cybench.datasets package

Submodules

cybench.datasets.alignment module

cybench.datasets.alignment.add_cutoff_days(df: DataFrame, lead_time: str)

Add a column with cutoff days relative to end of season.

Parameters:

df (pd.DataFrame) – time series data
lead_time (str) – lead_time option

Returns:

the same DataFrame with column added

cybench.datasets.alignment.aggregate_time_series_data(df_ts: DataFrame, aggregate_time_series_to: str)

Aggregate time series data to the specified resolution.

Parameters:

df_ts (pd.DataFrame) – time series data in daily resolution
aggregate_time_series_to (str) – resolution of aggregated data

Returns:

pd.DataFrame with interpolated data

cybench.datasets.alignment.align_inputs_and_labels(df_y: DataFrame, dfs_x: dict) → tuple

Align inputs and labels to have common indices (KEY_LOC, KEY_YEAR). NOTE: Input data returned may still contain more (KEY_LOC, KEY_YEAR) entries than label data. This is fine because the index of label data is used to access input data and not the other way round.

Parameters:

df_y (pd.DataFrame) – target or label data
dfs_x (dict) – key is input source and value is pd.DataFrame

Returns:

the same DataFrame with dates aligned to crop season

cybench.datasets.alignment.align_to_crop_season_window(df: DataFrame, crop_season_df: DataFrame)

Align time series data to crop season window (includes lead time and spinup).

Parameters:

df (pd.DataFrame) – time series data
crop_season_df (pd.DataFrame) – crop season data
lead_time (str) – forecast lead time option

Returns:

the input DataFrame with data aligned to crop season and trimmed to lead time

cybench.datasets.alignment.align_to_crop_season_window_numpy(locs: ndarray, years: ndarray, dates: ndarray, crop_season_keys: dict, sos_dates: ndarray, eos_dates: ndarray, cutoff_dates: ndarray, season_window_lengths: ndarray): Aligns time series data using NumPy arrays and full vectorization, with optimized memory.

cybench.datasets.alignment.compute_crop_season_window(df, min_year, max_year, lead_time='middle-of-season')

Compute crop season window used for forecasting.

Parameters:

df (pd.DataFrame) – crop calendar data
min_year (int) – earliest year in target data
max_year (int) – latest year in target data
lead_time (str) – forecast lead time option

Returns:

the same DataFrame with crop season window information

cybench.datasets.alignment.ensure_same_categories_union(df_1, df_2, cat_key='adm_id'): Ensures that cat_key has the same categories in both DataFrames, using the union of unique values.

cybench.datasets.alignment.interpolate_time_series_data(dfs: list, df_crop_season: DataFrame, max_season_window_length: int)

Add dates covering season window length and interpolate to fill in NAs.

Parameters:

dfs (list) – time series DataFrames
df_crop_season (pd.DataFrame) – crop season
max_season_window_length (int) – maximum season window length

Returns:

pd.DataFrame with interpolated data

cybench.datasets.alignment.interpolate_time_series_data_items(X: list, max_season_window_length: int)

Add dates covering season window length and interpolate to fill in NAs.

Parameters:

X (list) – data samples
max_season_window_length (int) – maximum season window length

Returns:

pd.DataFrame with interpolated data

cybench.datasets.alignment.process_crop_seasons(locs, years, dates, crop_season_keys, sos_dates, eos_dates, cutoff_dates, season_window_lengths): Processes crop season data efficiently using NumPy.

cybench.datasets.alignment.restore_category_to_string(df, cat_key='adm_id', original_order=None): Converts a categorical column back to string, restoring the original order if available.

cybench.datasets.configured module

cybench.datasets.configured._load_and_preprocess_time_series_data(crop, country_code, ts_input, index_cols, ts_cols, df_crop_cal, use_memory_optimization=True, verbose=False)

A helper function to load and preprocess time series data.

Parameters:

crop (str) – crop name
country_code (str) – 2-letter country code
ts_input (str) – time series input (used to name data file)
index_cols (list) – columns used as index
ts_cols (list) – columns with time series variables
df_crop_cal (pd.DataFrame) – crop calendar data
use_memory_optimization (bool) – use (slower) memory-optimized function for crop season alignment
verbose (bool) – output detailed processing information.

Returns:

the same DataFrame after preprocessing and aligning to crop season

cybench.datasets.configured.load_dfs(crop: str, country_code: str, use_memory_optimization: bool = True) → tuple

Load data from CSV files for crop and country. Expects CSV files in PATH_DATA_DIR/<crop>/<country_code>/.

Parameters:

crop (str) – crop name
country_code (str) – 2-letter country code
use_memory_optimization (bool) – use (slower) memory-optimized function for crop season alignment

Returns:

a tuple (target DataFrame, dict of input DataFrames)

cybench.datasets.configured.load_dfs_crop(crop: str, countries: list = None) → dict

Load data for crop and one or more countries. If countries is None, data for all countries in CY-Bench is loaded.

Parameters:

crop (str) – crop name
countries (list) – list of 2-letter country codes

Returns:

a tuple (target DataFrame, dict of input DataFrames)

cybench.datasets.dataset module

class cybench.datasets.dataset.Dataset(crop, data_target: DataFrame = None, data_inputs: dict = None)

Bases: object

static _empty_df_target() → DataFrame: Helper function that creates an empty (but rightly formatted) dataframe for yield statistics

static _filter_df_on_index(df: DataFrame, keys: list, level: int)

Helper method for filtering a dataframe based on the occurrence of certain values in a specified index

Parameters:

df – the dataframe that should be filtered
keys – the values on which it should filter
level – the index level in which samples should be filtered

Returns:

a filtered dataframe

_get_feature_data(loc_id: str, year: int) → dict: Helper function for obtaining feature data corresponding to some index :param loc_id: location index value :param year: year index value :return: a dict containing all feature data corresponding to the specified index

static _split_df_on_index(df: DataFrame, split: tuple, level: int)

static _validate_dfs(df_y: DataFrame, dfs_x: dict) → bool

Helper function that implements some checks on whether the input dataframes are correctly formatted

Parameters:

df_y – dataframe containing yield statistics
dfs_x – dict of data source to dataframes each containing feature data

Returns:

a bool indicating whether the test has passed

property crop

property feature_names: set: Obtain a set containing all feature names

get_normalization_params(normalization='standard'): Compute normalization parameters for input data. :param normalization: normalization method, default standard or z-score :return: a dict containing normalization parameters (e.g. mean and std)

indices() → list

static load(dataset_name: str) → Dataset

property location_ids: set: Obtain a set containing all location ids occurring in the dataset

property max_season_window_length: int

split_on_years(years_split: tuple) → tuple

Create two new datasets based on the provided split in years

Parameters:: years_split – tuple e.g ([2012, 2014], [2013, 2015])
Returns:: two data sets

targets() → array: Obtain a numpy array of targets or labels

property years: set: Obtain a set containing all years occurring in the dataset

cybench.datasets.dataset_overview module

cybench.datasets.modified_dataset module

class cybench.datasets.modified_dataset.ModifiedTargetsDataset(dataset: Dataset, modified_targets: DataFrame = None): Bases: Dataset

cybench.datasets.torch_dataset module

class cybench.datasets.torch_dataset.TorchDataset(dataset: Dataset, interpolate_time_series: bool = False, aggregate_time_series_to: str = None, max_season_window_length: int = None)

Bases: Dataset, Dataset

classmethod cast_to_tensor(sample: dict) → dict: Create a sample with all data cast to torch tensors :param sample: the sample to convert :return: the converted data sample

classmethod collate_fn(samples: list) → dict: Function that takes a list of data samples (as dicts, containing torch tensors) and converts it to a dict of batched torch tensors :param samples: a list of data samples :return: a dict with batched data

classmethod interpolate_and_aggregate(samples: list, max_season_window_length: int, aggregate_time_series_to: str = None): Function that takes a list of data samples (as dicts, containing numpy arrays) and interpolates and (optionally) aggregates time series data :param samples: a list of data samples :param max_season_window_length: maximum length of time series :param aggregate_time_series_to: resolution to aggregate time series to :return: the same data samples after interpolation and aggregation

cybench.datasets package

Submodules

cybench.datasets.alignment module

cybench.datasets.configured module

cybench.datasets.dataset module

cybench.datasets.dataset_overview module

cybench.datasets.modified_dataset module

cybench.datasets.torch_dataset module

Module contents