cybench.datasets package

Submodules

cybench.datasets.alignment module

cybench.datasets.alignment.add_cutoff_days(df: DataFrame, lead_time: str)

Add a column with cutoff days relative to end of season.

Parameters:
  • df (pd.DataFrame) – time series data

  • lead_time (str) – lead_time option

Returns:

the same DataFrame with column added

cybench.datasets.alignment.aggregate_time_series_data(df_ts: DataFrame, aggregate_time_series_to: str)

Aggregate time series data to the specified resolution.

Parameters:
  • df_ts (pd.DataFrame) – time series data in daily resolution

  • aggregate_time_series_to (str) – resolution of aggregated data

Returns:

pd.DataFrame with interpolated data

cybench.datasets.alignment.align_inputs_and_labels(df_y: DataFrame, dfs_x: dict) tuple

Align inputs and labels to have common indices (KEY_LOC, KEY_YEAR). NOTE: Input data returned may still contain more (KEY_LOC, KEY_YEAR) entries than label data. This is fine because the index of label data is is used to access input data and not the other way round.

Parameters:
  • df_y (pd.DataFrame) – target or label data

  • dfs_x (dict) – key is input source and and value is pd.DataFrame

Returns:

the same DataFrame with dates aligned to crop season

cybench.datasets.alignment.align_to_crop_season_window(df: DataFrame, crop_season_df: DataFrame)

Align time series data to crop season window (includes lead time and spinup).

Parameters:
  • df (pd.DataFrame) – time series data

  • crop_season_df (pd.DataFrame) – crop season data

  • lead_time (str) – forecast lead time option

Returns:

the input DataFrame with data aligned to crop season and trimmed to lead time

cybench.datasets.alignment.compute_crop_season_window(df, min_year, max_year, lead_time='middle-of-season')

Compute crop season window used for forecasting.

Parameters:
  • df (pd.DataFrame) – crop calendar data

  • min_year (int) – earliest year in target data

  • max_year (int) – latest year in target data

  • lead_time (str) – forecast lead time option

Returns:

the same DataFrame with crop season window information

cybench.datasets.alignment.interpolate_time_series_data(dfs: list, df_crop_season: DataFrame, max_season_window_length: int)

Add dates covering season window length and interpolate to fill in NAs.

Parameters:
  • dfs (list) – time series DataFrames

  • df_crop_season (pd.DataFrame) – crop season

  • max_season_window_length (int) – maximum season window length

Returns:

pd.DataFrame with interpolated data

cybench.datasets.alignment.interpolate_time_series_data_items(X: list, max_season_window_length: int)

Add dates covering season window length and interpolate to fill in NAs.

Parameters:
  • X (list) – data samples

  • max_season_window_length (int) – maximum season window length

Returns:

pd.DataFrame with interpolated data

cybench.datasets.configured module

cybench.datasets.configured._load_and_preprocess_time_series_data(crop, country_code, ts_input, index_cols, ts_cols, df_crop_cal)

A helper function to load and preprocess time series data.

Parameters:
  • crop (str) – crop name

  • country_code (str) – 2-letter country code

  • ts_input (str) – time series input (used to name data file)

  • index_cols (list) – columns used as index

  • ts_cols (list) – columns with time series variables

  • df_crop_cal (pd.DataFrame) – crop calendar data

Returns:

the same DataFrame after preprocessing and aligning to crop season

cybench.datasets.configured.load_dfs(crop: str, country_code: str) tuple

Load data from CSV files for crop and country. Expects CSV files in PATH_DATA_DIR/<crop>/<country_code>/.

Parameters:
  • crop (str) – crop name

  • country_code (str) – 2-letter country code

Returns:

a tuple (target DataFrame, dict of input DataFrames)

cybench.datasets.configured.load_dfs_crop(crop: str, countries: list = None) dict

Load data for crop and one or more countries. If countries is None, data for all countries in CY-Bench is loaded.

Parameters:
  • crop (str) – crop name

  • countries (list) – list of 2-letter country codes

Returns:

a tuple (target DataFrame, dict of input DataFrames)

cybench.datasets.dataset module

class cybench.datasets.dataset.Dataset(crop, data_target: DataFrame = None, data_inputs: dict = None)

Bases: object

static _empty_df_target() DataFrame

Helper function that creates an empty (but rightly formatted) dataframe for yield statistics

static _filter_df_on_index(df: DataFrame, keys: list, level: int)

Helper method for filtering a dataframe based on the occurrence of certain values in a specified index

Parameters:
  • df – the dataframe that should be filtered

  • keys – the values on which it should filter

  • level – the index level in which samples should be filtered

Returns:

a filtered dataframe

_get_feature_data(loc_id: str, year: int) dict

Helper function for obtaining feature data corresponding to some index :param loc_id: location index value :param year: year index value :return: a dict containing all feature data corresponding to the specified index

static _split_df_on_index(df: DataFrame, split: tuple, level: int)
static _validate_dfs(df_y: DataFrame, dfs_x: dict) bool

Helper function that implements some checks on whether the input dataframes are correctly formatted

Parameters:
  • df_y – dataframe containing yield statistics

  • dfs_x – dict of data source to dataframes each containing feature data

Returns:

a bool indicating whether the test has passed

property crop
property feature_names: set

Obtain a set containing all feature names

get_normalization_params(normalization='standard')

Compute normalization parameters for input data. :param normalization: normalization method, default standard or z-score :return: a dict containing normalization parameters (e.g. mean and std)

indices() list
static load(dataset_name: str) Dataset
property location_ids: set

Obtain a set containing all location ids occurring in the dataset

property max_season_window_length: int
split_on_years(years_split: tuple) tuple

Create two new datasets based on the provided split in years

Parameters:

years_split – tuple e.g ([2012, 2014], [2013, 2015])

Returns:

two data sets

targets() array

Obtain an numpy array of targets or labels

property years: set

Obtain a set containing all years occurring in the dataset

cybench.datasets.dataset_overview module

cybench.datasets.modified_dataset module

class cybench.datasets.modified_dataset.ModifiedTargetsDataset(dataset: Dataset, modified_targets: DataFrame = None)

Bases: Dataset

cybench.datasets.torch_dataset module

class cybench.datasets.torch_dataset.TorchDataset(dataset: Dataset, interpolate_time_series: bool = False, aggregate_time_series_to: str = None, max_season_window_length: int = None)

Bases: Dataset, Dataset

classmethod cast_to_tensor(sample: dict) dict

Create a sample with all data cast to torch tensors :param sample: the sample to convert :return: the converted data sample

classmethod collate_fn(samples: list) dict

Function that takes a list of data samples (as dicts, containing torch tensors) and converts it to a dict of batched torch tensors :param samples: a list of data samples :return: a dict with batched data

classmethod interpolate_and_aggregate(samples: list, max_season_window_length: int, aggregate_time_series_to: str = None)

Function that takes a list of data samples (as dicts, containing numpy arrays) and interpolates and (optionally) aggregates time series data :param samples: a list of data samples :param max_season_window_length: maximum length of time series :param aggregate_time_series_to: resolution to aggregate time series to :return: the same data samples after interpolation and aggregation

Module contents