cybench.datasets package
Submodules
cybench.datasets.alignment module
- cybench.datasets.alignment.add_cutoff_days(df: DataFrame, lead_time: str)
Add a column with cutoff days relative to end of season.
- Parameters:
df (pd.DataFrame) – time series data
lead_time (str) – lead_time option
- Returns:
the same DataFrame with column added
- cybench.datasets.alignment.aggregate_time_series_data(df_ts: DataFrame, aggregate_time_series_to: str)
Aggregate time series data to the specified resolution.
- Parameters:
df_ts (pd.DataFrame) – time series data in daily resolution
aggregate_time_series_to (str) – resolution of aggregated data
- Returns:
pd.DataFrame with interpolated data
- cybench.datasets.alignment.align_inputs_and_labels(df_y: DataFrame, dfs_x: dict) tuple
Align inputs and labels to have common indices (KEY_LOC, KEY_YEAR). NOTE: Input data returned may still contain more (KEY_LOC, KEY_YEAR) entries than label data. This is fine because the index of label data is is used to access input data and not the other way round.
- Parameters:
df_y (pd.DataFrame) – target or label data
dfs_x (dict) – key is input source and and value is pd.DataFrame
- Returns:
the same DataFrame with dates aligned to crop season
- cybench.datasets.alignment.align_to_crop_season_window(df: DataFrame, crop_season_df: DataFrame)
Align time series data to crop season window (includes lead time and spinup).
- Parameters:
df (pd.DataFrame) – time series data
crop_season_df (pd.DataFrame) – crop season data
lead_time (str) – forecast lead time option
- Returns:
the input DataFrame with data aligned to crop season and trimmed to lead time
- cybench.datasets.alignment.compute_crop_season_window(df, min_year, max_year, lead_time='middle-of-season')
Compute crop season window used for forecasting.
- Parameters:
df (pd.DataFrame) – crop calendar data
min_year (int) – earliest year in target data
max_year (int) – latest year in target data
lead_time (str) – forecast lead time option
- Returns:
the same DataFrame with crop season window information
- cybench.datasets.alignment.interpolate_time_series_data(dfs: list, df_crop_season: DataFrame, max_season_window_length: int)
Add dates covering season window length and interpolate to fill in NAs.
- Parameters:
dfs (list) – time series DataFrames
df_crop_season (pd.DataFrame) – crop season
max_season_window_length (int) – maximum season window length
- Returns:
pd.DataFrame with interpolated data
- cybench.datasets.alignment.interpolate_time_series_data_items(X: list, max_season_window_length: int)
Add dates covering season window length and interpolate to fill in NAs.
- Parameters:
X (list) – data samples
max_season_window_length (int) – maximum season window length
- Returns:
pd.DataFrame with interpolated data
cybench.datasets.configured module
- cybench.datasets.configured._load_and_preprocess_time_series_data(crop, country_code, ts_input, index_cols, ts_cols, df_crop_cal)
A helper function to load and preprocess time series data.
- Parameters:
crop (str) – crop name
country_code (str) – 2-letter country code
ts_input (str) – time series input (used to name data file)
index_cols (list) – columns used as index
ts_cols (list) – columns with time series variables
df_crop_cal (pd.DataFrame) – crop calendar data
- Returns:
the same DataFrame after preprocessing and aligning to crop season
- cybench.datasets.configured.load_dfs(crop: str, country_code: str) tuple
Load data from CSV files for crop and country. Expects CSV files in PATH_DATA_DIR/<crop>/<country_code>/.
- Parameters:
crop (str) – crop name
country_code (str) – 2-letter country code
- Returns:
a tuple (target DataFrame, dict of input DataFrames)
- cybench.datasets.configured.load_dfs_crop(crop: str, countries: list = None) dict
Load data for crop and one or more countries. If countries is None, data for all countries in CY-Bench is loaded.
- Parameters:
crop (str) – crop name
countries (list) – list of 2-letter country codes
- Returns:
a tuple (target DataFrame, dict of input DataFrames)
cybench.datasets.dataset module
- class cybench.datasets.dataset.Dataset(crop, data_target: DataFrame = None, data_inputs: dict = None)
Bases:
object
- static _empty_df_target() DataFrame
Helper function that creates an empty (but rightly formatted) dataframe for yield statistics
- static _filter_df_on_index(df: DataFrame, keys: list, level: int)
Helper method for filtering a dataframe based on the occurrence of certain values in a specified index
- Parameters:
df – the dataframe that should be filtered
keys – the values on which it should filter
level – the index level in which samples should be filtered
- Returns:
a filtered dataframe
- _get_feature_data(loc_id: str, year: int) dict
Helper function for obtaining feature data corresponding to some index :param loc_id: location index value :param year: year index value :return: a dict containing all feature data corresponding to the specified index
- static _split_df_on_index(df: DataFrame, split: tuple, level: int)
- static _validate_dfs(df_y: DataFrame, dfs_x: dict) bool
Helper function that implements some checks on whether the input dataframes are correctly formatted
- Parameters:
df_y – dataframe containing yield statistics
dfs_x – dict of data source to dataframes each containing feature data
- Returns:
a bool indicating whether the test has passed
- property crop
- property feature_names: set
Obtain a set containing all feature names
- get_normalization_params(normalization='standard')
Compute normalization parameters for input data. :param normalization: normalization method, default standard or z-score :return: a dict containing normalization parameters (e.g. mean and std)
- indices() list
- property location_ids: set
Obtain a set containing all location ids occurring in the dataset
- property max_season_window_length: int
- split_on_years(years_split: tuple) tuple
Create two new datasets based on the provided split in years
- Parameters:
years_split – tuple e.g ([2012, 2014], [2013, 2015])
- Returns:
two data sets
- targets() array
Obtain an numpy array of targets or labels
- property years: set
Obtain a set containing all years occurring in the dataset
cybench.datasets.dataset_overview module
cybench.datasets.modified_dataset module
cybench.datasets.torch_dataset module
- class cybench.datasets.torch_dataset.TorchDataset(dataset: Dataset, interpolate_time_series: bool = False, aggregate_time_series_to: str = None, max_season_window_length: int = None)
Bases:
Dataset
,Dataset
- classmethod cast_to_tensor(sample: dict) dict
Create a sample with all data cast to torch tensors :param sample: the sample to convert :return: the converted data sample
- classmethod collate_fn(samples: list) dict
Function that takes a list of data samples (as dicts, containing torch tensors) and converts it to a dict of batched torch tensors :param samples: a list of data samples :return: a dict with batched data
- classmethod interpolate_and_aggregate(samples: list, max_season_window_length: int, aggregate_time_series_to: str = None)
Function that takes a list of data samples (as dicts, containing numpy arrays) and interpolates and (optionally) aggregates time series data :param samples: a list of data samples :param max_season_window_length: maximum length of time series :param aggregate_time_series_to: resolution to aggregate time series to :return: the same data samples after interpolation and aggregation