cybench.models package

Submodules

cybench.models.model module

Model base class

class cybench.models.model.BaseModel

Bases: ABC

_abc_impl = <_abc._abc_data object>
abstract fit(dataset: Dataset, **fit_params) tuple

Fit or train the model.

Parameters:
  • dataset – Dataset

  • **fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

abstract load(model_name)

Deserialize a saved model.

Parameters:

model_name – Filename that was used to save the model.

Returns:

The deserialized model.

predict(dataset: Dataset, **predict_params) tuple

Run fitted model on data.

Parameters:
  • dataset – Dataset

  • **predict_params – Additional parameters.

Returns:

A tuple containing a np.ndarray and a dict with additional information.

abstract predict_items(X: list, **predict_params)

Run fitted model on a list of data items.

Parameters:
  • X – a list of data items, each of which is a dict

  • **predict_params – Additional parameters.

Returns:

A tuple containing a np.ndarray and a dict with additional information.

abstract save(model_name)

Save model, e.g. using pickle.

Parameters:

model_name – Filename that will be used to save the model.

cybench.models.naive_models module

class cybench.models.naive_models.AverageYieldModel(group_by=['adm_id'])

Bases: BaseModel

A naive yield prediction model.

Predicts the average of the training set by location. If the location is not in the training data, then predicts the global average.

_abc_impl = <_abc._abc_data object>
fit(dataset: Dataset, **fit_params) tuple

Fit or train the model.

Parameters:
  • dataset – Dataset

  • **fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

load(model_name)

Deserialize a saved model.

Parameters:

model_name – Filename that was used to save the model.

Returns:

The deserialized model.

predict_items(X: list)

Run fitted model on a list of data items.

Parameters:

X – a list of data items, each of which is a dict

Returns:

A tuple containing a np.ndarray and a dict with additional information.

save(model_name)

Save model, e.g. using pickle.

Parameters:

model_name – Filename that will be used to save the model.

cybench.models.nn_models module

class cybench.models.nn_models.BaseNNModel(**kwargs)

Bases: BaseModel, Module

_abc_impl = <_abc._abc_data object>
_forward_pass(batch: dict, device: str)

A forward pass for batched data.

Parameters:
  • batch (dict) – batched inputs

  • device (str) – the device to use

Returns:

An np.ndarray

_normalize_inputs(inputs)

Normalize inputs using saved normalization parameters.

Parameters:

inputs (dict) – unnormalized inputs

Returns:

The same dict after normalizing the entries

_optimize_hyperparameters(dataset: Dataset, param_space: dict, optim_kwargs: dict, device: str = 'cpu', kfolds: int = 1, epochs: int = 10, **fit_params) dict

Optimize hyperparameters

Parameters:
  • dataset (Dataset) – training dataset

  • param_space (dict) – hypperparameters to optimize

  • optim_kwargs (dict) – arguments to the optimizer

  • device (str) – the device to use

  • kfolds (int) – k for k-fold cv (default: 1)

  • epochs (int) – Number of epochs to train (default: 10)

  • **fit_params – Additional parameters.

Returns:

A dict of optimal hyperparameter setting

_train_and_validate(dataset: ~cybench.datasets.dataset.Dataset, train_years: list, val_years: list, validation_interval: int = 5, epochs: int = 10, batch_size: int = 16, optimizer_fn: callable = <class 'torch.optim.adam.Adam'>, optim_kwargs: dict = {}, loss_fn: callable = <function mse_loss>, loss_kwargs: dict = {}, scheduler_fn: callable | None = None, sched_kwargs: dict = {}, device: str = 'cpu', **kwargs)

Fit or train the model and evaluate on validation data.

Parameters:
  • dataset (Dataset) – training dataset

  • train_years (list) – training years

  • val_years (list) – validation years

  • validation_interval (int) – validation frequency (default: 5)

  • epochs (int) – the number of epochs to train the model (default: 10)

  • batch_size (int) – the batch size (default: 16)

  • optim_fn (callable) – the optimizer function (default: Adam)

  • optim_kwargs (dict) – arguments to the optimizer function

  • loss_fn (callable) – the loss function (default: mse_loss)

  • loss_kwargs (dict) – arguments to the loss function

  • scheduler_fn (callable) – the scheduler function (default: None)

  • sched_kwargs (dict) – arguments to the scheduler function

  • device (str) – the device to use

  • **kwargs – Additional parameters.

Returns:

A tuple training losses, validation losses and maximum epochs to train.

_train_epoch(tqdm_loader: ~tqdm.std.tqdm, device: str, optimizer: ~torch.optim.optimizer.Optimizer, loss_fn: callable = <function mse_loss>, loss_kwargs: dict = {'reduction': 'mean'}, scheduler: ~torch.optim.lr_scheduler.LRScheduler | None = None)

Run one epoch during training

Parameters:
  • tqdm_loader (tqdm) – data loader with progress bar

  • device (str) – the device to use

  • optimizer (torch.optim.Optimizer) – the optimizer

  • loss_fn (callable) – the loss function, default mse_loss

  • loss_kwargs (dict) – the arguments to loss_fn

  • scheduler (torch.optim.lr_scheduler.LRScheduler) – scheduler for learning rate of optimizer

Returns:

The average of all batch losses

_train_final_model(dataset: ~cybench.datasets.dataset.Dataset, epochs: int, optimizer_fn: callable = <class 'torch.optim.adam.Adam'>, optim_kwargs: dict = {}, loss_fn: callable = <function mse_loss>, loss_kwargs: dict = {'reduction': 'mean'}, scheduler_fn: callable | None = None, sched_kwargs: dict = {}, device: str = 'cpu', batch_size: int = 16, **kwargs)

Fit or train the model on the entire training set.

Parameters:
  • dataset (Dataset) – training dataset,

  • epochs (int) – number of epochs to train

  • optimizer_fn (callable) – the optimizer function (default: Adam)

  • optim_kwargs (dict) – arguments to the optimizer function

  • loss_fn (callable) – the loss function (default: mse_loss)

  • loss_kwargs (dict) – arguments to the loss function

  • scheduler_fn (callable) – the scheduler function (default: None)

  • sched_kwargs (dict) – arguments to the scheduler function

  • device (str) – the device to use

  • batch_size (int) – default is 16

  • **kwargs – Additional parameters.

Returns:

A list of training losses (one value per epoch).

fit(dataset: Dataset, optimize_hyperparameters: bool = False, param_space: dict | None = None, optim_kwargs: dict = {}, device: str = 'cpu', seed: int = 42, **fit_params)

Fit or train the model.

Parameters:
  • dataset (Dataset) – training dataset.

  • optimize_hyperparameters (bool) – whether to tune hyperparameters

  • param_space (dict) – each entry is a hyperparameter name and list or range of values

  • optim_kwargs (dict) – arguments to the optimizer

  • device (str) – the device to use.

  • seed (float) – seed for random number generator

  • **fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

classmethod load(model_name)

Load model using torch.load.

Parameters:

model_name (str) – Filename that was used to save the model.

Returns:

The loaded model.

predict(dataset: Dataset, device: str = 'cpu', batch_size: int = 16, **predict_params)

Run fitted model on batched data items.

Parameters:
  • dataset (Dataset) – validation dataset

  • device (str) – the device to use

  • **predict_params – Additional parameters

Returns:

A tuple containing a np.ndarray and a dict with additional information.

predict_items(X: list, device: str = 'cpu', **predict_params)

Run fitted model on a list of data items.

Parameters:
  • X (list) – a list of data items, each of which is a dict

  • device (str) – str, the device to use

  • **predict_params – Additional parameters

Returns:

A tuple containing a np.ndarray and a dict with additional information.

save(model_name)

Save model using torch.save.

Parameters:

model_name (str) – Filename that will be used to save the model.

class cybench.models.nn_models.BaselineInceptionTime(time_series_have_same_length=False, hidden_size=64, num_layers=6, num_features=32, output_size=1, **kwargs)

Bases: BaseNNModel

InceptionTime model.

Parameters:
  • time_series_have_same_length (bool) – whether time series have the same length

  • hidden_size (int) – The number of features InceptionTime outputs

  • num_layers (int) – The number of InceptionBlocks. Defaults to 6.

  • num_features (int) – The number of features within the InceptionBlocks. Defaults to 32.

  • output_size (int) – The number of output classes. Defaults to 1.

  • **kwargs – Additional keyword arguments passed to the base class.

_abc_impl = <_abc._abc_data object>
fit(dataset: Dataset, optimize_hyperparameters: bool = False, param_space: dict = {}, kfolds: int = 1, epochs: int = 10, device: str = 'cpu', seed: int = 42, **fit_params)

Fit or train the model.

Parameters:
  • dataset (Dataset) – Training dataset.

  • optimize_hyperparameters (bool) – Flag to tune hyperparameters.

  • param_space (dict) – Each entry is a hyperparameter name and list or range of values.

  • kfolds (int) – k in k-fold cv.

  • epochs (int) – Number of epochs to train.

  • seed (float) – seed for random number generator.

  • **fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

forward(x)

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cybench.models.nn_models.BaselineLSTM(time_series_have_same_length=False, hidden_size=64, num_layers=1, output_size=1, **kwargs)

Bases: BaseNNModel

LSTM model.

Parameters:
  • time_series_have_same_length (bool) – whether time series have the same length

  • hidden_size (int) – The number of features InceptionTime outputs

  • num_layers (int) – The number of InceptionBlocks. Defaults to 6.

  • output_size (int) – The number of output classes. Defaults to 1.

  • **kwargs – Additional keyword arguments passed to the base class.

_abc_impl = <_abc._abc_data object>
fit(dataset: Dataset, optimize_hyperparameters: bool = False, param_space: dict = {}, kfolds: int = 1, epochs: int = 10, device: str = 'cpu', seed: int = 42, **fit_params)

Fit or train the model.

Parameters:
  • dataset (Dataset) – Training dataset.

  • optimize_hyperparameters (bool) – Flag to tune hyperparameters.

  • param_space (dict) – Each entry is a hyperparameter name and list or range of values.

  • kfolds (int) – k in k-fold cv.

  • epochs (int) – Number of epochs to train.

  • seed (float) – seed for random number generator.

  • **fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

forward(x)

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cybench.models.nn_models.BaselineTransformer(seq_len, time_series_have_same_length=False, hidden_size=64, d_model=64, n_head=1, d_ff=256, output_size=1, num_layers=3, **kwargs)

Bases: BaseNNModel

Transformer model.

Parameters:
  • seq_len (int) – length of time series sequence (in days)

  • time_series_have_same_length (bool) – whether time series have the same length

  • hidden_size (int) – The number of resulting timeseries features.

  • d_moodel (int) – Total dimension of the model.

  • n_head (int) – Parallel attention heads.

  • d_ffn (int) – The dimension of the feedforward network model.

  • output_size (int) – The number of output classes. Defaults to 1.

  • num_layers (1) – The number of sub-encoder-layers in the encoder.

  • **kwargs – Additional keyword arguments passed to the base class.

_abc_impl = <_abc._abc_data object>
fit(dataset: Dataset, optimize_hyperparameters: bool = False, param_space: dict = {}, kfolds: int = 1, epochs: int = 10, device: str = 'cpu', seed: int = 42, **fit_params)

Fit or train the model.

Parameters:
  • dataset (Dataset) – Training dataset.

  • optimize_hyperparameters (bool) – Flag to tune hyperparameters.

  • param_space (dict) – Each entry is a hyperparameter name and list or range of values.

  • kfolds (int) – k in k-fold cv.

  • epochs (int) – Number of epochs to train.

  • seed (float) – seed for random number generator.

  • **fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

forward(x)

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

cybench.models.nn_models.separate_ts_static_inputs(batch: dict) tuple

Stack time series and static inputs separately.

Parameters:

batch (dict) – batched inputs

Returns:

A tuple of torch tensors for time series and static inputs

cybench.models.residual_models module

class cybench.models.residual_models.InceptionTimeRes(**kwargs)

Bases: ResidualModel

_abc_impl = <_abc._abc_data object>
class cybench.models.residual_models.LSTMRes(**kwargs)

Bases: ResidualModel

_abc_impl = <_abc._abc_data object>
class cybench.models.residual_models.RandomForestRes(feature_cols: list | None = None)

Bases: ResidualModel

_abc_impl = <_abc._abc_data object>
class cybench.models.residual_models.ResidualModel(baseline_model: BaseModel)

Bases: BaseModel

_abc_impl = <_abc._abc_data object>
fit(dataset: Dataset, **fit_params)

Fit or train the model.

Parameters:
  • train_dataset (Dataset) – training dataset

  • **fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

load(model_name: str)

Deserialize a saved model.

Parameters:

model_name (str) – Filename that was used to save the model.

Returns:

The deserialized model.

predict(dataset: Dataset, **predict_params)

Run fitted model on batched data items.

Parameters:
  • dataset (Dataset) – test dataset

  • **predict_params – Additional parameters.

Returns:

A tuple containing a np.ndarray and a dict with additional information.

predict_items(X: list, crop=None, **predict_params)

Run fitted model on a list of data items.

Parameters:
  • X (list) – a list of data items, each of which is a dict

  • crop (str) – crop name

  • **predict_params – Additional parameters.

Returns:

A tuple containing a np.ndarray and a dict with additional information.

save(model_name: str)

Save model, e.g. using pickle. Check here for options to save and load scikit-learn models: https://scikit-learn.org/stable/model_persistence.html

Parameters:

model_name (str) – Filename that will be used to save the model.

class cybench.models.residual_models.RidgeRes(feature_cols: list | None = None)

Bases: ResidualModel

_abc_impl = <_abc._abc_data object>
class cybench.models.residual_models.TransformerRes(**kwargs)

Bases: ResidualModel

_abc_impl = <_abc._abc_data object>

cybench.models.sklearn_models module

class cybench.models.sklearn_models.BaseSklearnModel(**kwargs)

Bases: BaseModel

Base class for wrappers around scikit learn estimators.

_abc_impl = <_abc._abc_data object>
_design_features(crop: str, data_items: Iterable)

Design features using data samples.

Parameters:
  • crop (str) – crop name

  • data_items (Iterable) – a Dataset or list of data items.

Returns:

A pandas dataframe with KEY_LOC, KEY_YEAR and features.

_optimize_hyperparameters(X: ndarray, y: ndarray, param_space: dict, groups: ndarray | None = None, kfolds=5)

Optimize hyperparameters

Parameters:
  • X (np.ndarray) – training features

  • y (np.ndarray) – training labels

  • param_space (dict) – hyperparameters to optimize

  • groups (np.ndarray) – group values (e.g year values) for each row in X and y

  • kfolds (int) – number of splits in cross validation

Returns:

A sklearn pipeline refitted with the optimal hyperparameters.

_predict(crop: str, data_items: Iterable)

Utility method called by both predict_items and predict.

Parameters:
  • crop (str) – crop name

  • data_items (Iterable) – a Dataset or a list of data items

Returns:

A tuple containing a np.ndarray and a dict with additional information.

fit(dataset: Dataset, optimize_hyperparameters=False, select_features=False, **fit_params) tuple

Fit or train the model.

Parameters:
  • dataset (Dataset) – training dataset

  • optimize_hyperparameters (bool) – flag to optimize hyperparameters

  • select_features (bool) – flat to select features

  • **fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

load(model_name: str)

Deserialize a saved model.

Parameters:

model_name (str) – Filename that was used to save the model.

Returns:

The deserialized model.

predict(dataset: Dataset, **predict_params)

Run fitted model on batched data items.

Parameters:
  • dataset (Dataset) – test dataset

  • **predict_params – Additional parameters.

Returns:

A tuple containing a np.ndarray and a dict with additional information.

predict_items(X: list, crop=None, **predict_params)

Run fitted model on a list of data items.

Parameters:
  • X (list) – a list of data items, each of which is a dict

  • crop (str) – crop name

  • **predict_params – Additional parameters.

Returns:

A tuple containing a np.ndarray and a dict with additional information.

save(model_name: str)

Save model, e.g. using pickle. Check here for options to save and load scikit-learn models: https://scikit-learn.org/stable/model_persistence.html

Parameters:

model_name (str) – Filename that will be used to save the model.

class cybench.models.sklearn_models.SklearnRandomForest(feature_cols: list | None = None)

Bases: BaseSklearnModel

_abc_impl = <_abc._abc_data object>
fit(train_dataset: Dataset, **fit_params)

Fit or train the model.

Parameters:
  • train_dataset (Dataset) – training dataset

  • **fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

class cybench.models.sklearn_models.SklearnRidge(feature_cols: list | None = None)

Bases: BaseSklearnModel

_abc_impl = <_abc._abc_data object>
fit(train_dataset: Dataset, **fit_params)

Fit or train the model.

Parameters:
  • train_dataset (Dataset) – training dataset

  • **fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

cybench.models.trend_models module

class cybench.models.trend_models.TrendModel

Bases: BaseModel

Default trend estimator.

Trend is estimated using years as features.

MAX_TREND_WINDOW_SIZE = 10
MIN_TREND_WINDOW_SIZE = 5
_abc_impl = <_abc._abc_data object>
_estimate_trend(trend_x: list, trend_y: list, test_x: int)

Implements a linear trend. From @mmeronijrc: Small sample sizes and the use of quadratic or loess trend an lead to strange results.

Parameters:
  • trend_x (list) – year in the trend window.

  • trend_y (list) – values (e.g. yields) in the trend window

  • test_x (int) – test year

Returns:

estimated trend (float)

_find_optimal_trend_window(train_labels: ndarray, window_years: list, extend_forward: bool = False)

Find the optimal trend window based on pymannkendall statistical test.

Parameters:
  • train_labels (np.ndarray) – years and values for a specific location.

  • window_years (list) – years to consider in a window

  • extend_forward (bool) – if true, extend trend window forward, else backward

Returns:

a list of years representing the optimal trend window

_predict_trend(test_data: Iterable)

Predict the trend for each data item in test data.

Parameters:

test_data (Iterable) – Dataset or a list of data items

Returns:

np.ndarray of predictions

fit(dataset: Dataset, **fit_params) tuple

Fit or train the model. :param dataset: Dataset :param **fit_params: Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

load(model_name)

Deserialize a saved model. :param model_name: Filename that was used to save the model.

Returns:

The deserialized model.

predict(dataset: Dataset, **predict_params)

Run fitted model on a test dataset.

Parameters:
  • dataset – Dataset

  • **predict_params – Additional parameters

Returns:

A tuple containing a np.ndarray and a dict with additional information.

predict_items(X: list, **predict_params)

Run fitted model on a list of data items.

Parameters:
  • X – a list of data items, each of which is a dict

  • **predict_params – Additional parameters

Returns:

A tuple containing a np.ndarray and a dict with additional information.

save(model_name)

Save model, e.g. using pickle. :param model_name: Filename that will be used to save the model.

Module contents