cybench.models package

Submodules

cybench.models.model module

Model base class

class cybench.models.model.BaseModel

Bases: ABC

_abc_impl = <_abc._abc_data object>

abstract fit(dataset: Dataset, **fit_params) → tuple

Fit or train the model.

Parameters:

dataset – Dataset
**fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

abstract load(model_name)

Deserialize a saved model.

Parameters:: model_name – Filename that was used to save the model.
Returns:: The deserialized model.

predict(dataset: Dataset, **predict_params) → tuple

Run fitted model on data.

Parameters:

dataset – Dataset
**predict_params – Additional parameters.

Returns:

A tuple containing a np.ndarray and a dict with additional information.

abstract predict_items(X: list, **predict_params)

Run fitted model on a list of data items.

Parameters:

X – a list of data items, each of which is a dict
**predict_params – Additional parameters.

Returns:

A tuple containing a np.ndarray and a dict with additional information.

abstract save(model_name)

Save model, e.g. using pickle.

Parameters:: model_name – Filename that will be used to save the model.

cybench.models.naive_models module

class cybench.models.naive_models.AverageYieldModel(group_by=['adm_id'])

Bases: BaseModel

A naive yield prediction model.

Predicts the average of the training set by location. If the location is not in the training data, then predicts the global average.

_abc_impl = <_abc._abc_data object>

fit(dataset: Dataset, **fit_params) → tuple

Fit or train the model.

Parameters:

dataset – Dataset
**fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

load(model_name)

Deserialize a saved model.

Parameters:: model_name – Filename that was used to save the model.
Returns:: The deserialized model.

predict_items(X: list)

Run fitted model on a list of data items.

Parameters:: X – a list of data items, each of which is a dict
Returns:: A tuple containing a np.ndarray and a dict with additional information.

save(model_name)

Save model, e.g. using pickle.

Parameters:: model_name – Filename that will be used to save the model.

cybench.models.nn_models module

class cybench.models.nn_models.BaseNNModel(**kwargs)

Bases: BaseModel, Module

_abc_impl = <_abc._abc_data object>

_forward_pass(batch: dict, device: str)

A forward pass for batched data.

Parameters:

batch (dict) – batched inputs
device (str) – the device to use

Returns:

An np.ndarray

_normalize_inputs(inputs)

Normalize inputs using saved normalization parameters.

Parameters:: inputs (dict) – unnormalized inputs
Returns:: The same dict after normalizing the entries

_optimize_hyperparameters(dataset: Dataset, param_space: dict, optim_kwargs: dict, device: str = 'cpu', kfolds: int = 1, epochs: int = 10, **fit_params) → dict

Optimize hyperparameters

Parameters:

dataset (Dataset) – training dataset
param_space (dict) – hyperparameters to optimize
optim_kwargs (dict) – arguments to the optimizer
device (str) – the device to use
kfolds (int) – k for k-fold cv (default: 1)
epochs (int) – Number of epochs to train (default: 10)
**fit_params – Additional parameters.

Returns:

A dict of optimal hyperparameter setting

_train_and_validate(dataset: ~cybench.datasets.dataset.Dataset, train_years: list, val_years: list, validation_interval: int = 5, epochs: int = 10, batch_size: int = 16, optimizer_fn: callable = <class 'torch.optim.adam.Adam'>, optim_kwargs: dict = {}, loss_fn: callable = <function mse_loss>, loss_kwargs: dict = {}, scheduler_fn: callable = None, sched_kwargs: dict = {}, device: str = 'cpu', **kwargs)

Fit or train the model and evaluate on validation data.

Parameters:

dataset (Dataset) – training dataset
train_years (list) – training years
val_years (list) – validation years
validation_interval (int) – validation frequency (default: 5)
epochs (int) – the number of epochs to train the model (default: 10)
batch_size (int) – the batch size (default: 16)
optim_fn (callable) – the optimizer function (default: Adam)
optim_kwargs (dict) – arguments to the optimizer function
loss_fn (callable) – the loss function (default: mse_loss)
loss_kwargs (dict) – arguments to the loss function
scheduler_fn (callable) – the scheduler function (default: None)
sched_kwargs (dict) – arguments to the scheduler function
device (str) – the device to use
**kwargs – Additional parameters.

Returns:

A tuple training losses, validation losses and maximum epochs to train.

_train_epoch(pbar: ~tqdm.std.tqdm, dataloader: ~torch.utils.data.dataloader.DataLoader, device: str, optimizer: ~torch.optim.optimizer.Optimizer, loss_fn: callable = <function mse_loss>, loss_kwargs: dict = {'reduction': 'mean'}, scheduler: ~torch.optim.lr_scheduler.LRScheduler = None)

Run one epoch during training

Parameters:

pbar (tqdm) – tqdm progress bar
dataloader (dataloader) – data loader with progress bar
device (str) – the device to use
optimizer (torch.optim.Optimizer) – the optimizer
loss_fn (callable) – the loss function, default mse_loss
loss_kwargs (dict) – the arguments to loss_fn
scheduler (torch.optim.lr_scheduler.LRScheduler) – scheduler for learning rate of optimizer

Returns:

The average of all batch losses

_train_final_model(dataset: ~cybench.datasets.dataset.Dataset, epochs: int, optimizer_fn: callable = <class 'torch.optim.adam.Adam'>, optim_kwargs: dict = {}, loss_fn: callable = <function mse_loss>, loss_kwargs: dict = {'reduction': 'mean'}, scheduler_fn: callable = None, sched_kwargs: dict = {}, device: str = 'cpu', batch_size: int = 16, **kwargs)

Fit or train the model on the entire training set.

Parameters:

dataset (Dataset) – training dataset,
epochs (int) – number of epochs to train
optimizer_fn (callable) – the optimizer function (default: Adam)
optim_kwargs (dict) – arguments to the optimizer function
loss_fn (callable) – the loss function (default: mse_loss)
loss_kwargs (dict) – arguments to the loss function
scheduler_fn (callable) – the scheduler function (default: None)
sched_kwargs (dict) – arguments to the scheduler function
device (str) – the device to use
batch_size (int) – default is 16
**kwargs – Additional parameters.

Returns:

A list of training losses (one value per epoch).

fit(dataset: Dataset, optimize_hyperparameters: bool = False, param_space: dict = None, optim_kwargs: dict = {}, device: str = 'cpu', seed: int = 42, **fit_params)

Fit or train the model.

Parameters:

dataset (Dataset) – training dataset.
optimize_hyperparameters (bool) – whether to tune hyperparameters
param_space (dict) – each entry is a hyperparameter name and list or range of values
optim_kwargs (dict) – arguments to the optimizer
device (str) – the device to use.
seed (float) – seed for random number generator
**fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

classmethod load(model_name)

Load model using torch.load.

Parameters:: model_name (str) – Filename that was used to save the model.
Returns:: The loaded model.

predict(dataset: Dataset, device: str = 'cpu', batch_size: int = 16, **predict_params)

Run fitted model on batched data items.

Parameters:

dataset (Dataset) – validation dataset
device (str) – the device to use
**predict_params – Additional parameters

Returns:

A tuple containing a np.ndarray and a dict with additional information.

predict_items(X: list, device: str = 'cpu', **predict_params)

Run fitted model on a list of data items.

Parameters:

X (list) – a list of data items, each of which is a dict
device (str) – str, the device to use
**predict_params – Additional parameters

Returns:

A tuple containing a np.ndarray and a dict with additional information.

save(model_name)

Save model using torch.save.

Parameters:: model_name (str) – Filename that will be used to save the model.

class cybench.models.nn_models.BaselineInceptionTime(time_series_have_same_length=False, hidden_size=64, num_layers=6, num_features=32, output_size=1, **kwargs)

Bases: BaseNNModel

InceptionTime model.

Parameters:

time_series_have_same_length (bool) – whether time series have the same length
hidden_size (int) – The number of features InceptionTime outputs
num_layers (int) – The number of InceptionBlocks. Defaults to 6.
num_features (int) – The number of features within the InceptionBlocks. Defaults to 32.
output_size (int) – The number of output classes. Defaults to 1.
**kwargs – Additional keyword arguments passed to the base class.

_abc_impl = <_abc._abc_data object>

fit(dataset: Dataset, optimize_hyperparameters: bool = False, param_space: dict = {}, kfolds: int = 1, epochs: int = 10, device: str = 'cpu', seed: int = 42, **fit_params)

Fit or train the model.

Parameters:

dataset (Dataset) – Training dataset.
optimize_hyperparameters (bool) – Flag to tune hyperparameters.
param_space (dict) – Each entry is a hyperparameter name and list or range of values.
kfolds (int) – k in k-fold cv.
epochs (int) – Number of epochs to train.
seed (float) – seed for random number generator.
**fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

forward(x)

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cybench.models.nn_models.BaselineLSTM(time_series_have_same_length=False, hidden_size=64, num_layers=1, output_size=1, **kwargs)

Bases: BaseNNModel

LSTM model.

Parameters:

time_series_have_same_length (bool) – whether time series have the same length
hidden_size (int) – The number of features InceptionTime outputs
num_layers (int) – The number of InceptionBlocks. Defaults to 6.
output_size (int) – The number of output classes. Defaults to 1.
**kwargs – Additional keyword arguments passed to the base class.

_abc_impl = <_abc._abc_data object>

fit(dataset: Dataset, optimize_hyperparameters: bool = False, param_space: dict = {}, kfolds: int = 1, epochs: int = 10, device: str = 'cpu', seed: int = 42, **fit_params)

Fit or train the model.

Parameters:

dataset (Dataset) – Training dataset.
optimize_hyperparameters (bool) – Flag to tune hyperparameters.
param_space (dict) – Each entry is a hyperparameter name and list or range of values.
kfolds (int) – k in k-fold cv.
epochs (int) – Number of epochs to train.
seed (float) – seed for random number generator.
**fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

forward(x)

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class cybench.models.nn_models.BaselineTransformer(seq_len, time_series_have_same_length=False, hidden_size=64, d_model=64, n_head=1, d_ff=256, output_size=1, num_layers=3, **kwargs)

Bases: BaseNNModel

Transformer model.

Parameters:

seq_len (int) – length of time series sequence (in days)
time_series_have_same_length (bool) – whether time series have the same length
hidden_size (int) – The number of resulting timeseries features.
d_moodel (int) – Total dimension of the model.
n_head (int) – Parallel attention heads.
d_ffn (int) – The dimension of the feedforward network model.
output_size (int) – The number of output classes. Defaults to 1.
num_layers (1) – The number of sub-encoder-layers in the encoder.
**kwargs – Additional keyword arguments passed to the base class.

_abc_impl = <_abc._abc_data object>

fit(dataset: Dataset, optimize_hyperparameters: bool = False, param_space: dict = {}, kfolds: int = 1, epochs: int = 10, device: str = 'cpu', seed: int = 42, **fit_params)

Fit or train the model.

Parameters:

dataset (Dataset) – Training dataset.
optimize_hyperparameters (bool) – Flag to tune hyperparameters.
param_space (dict) – Each entry is a hyperparameter name and list or range of values.
kfolds (int) – k in k-fold cv.
epochs (int) – Number of epochs to train.
seed (float) – seed for random number generator.
**fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

forward(x)

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

cybench.models.nn_models.separate_ts_static_inputs(batch: dict) → tuple

Stack time series and static inputs separately.

Parameters:: batch (dict) – batched inputs
Returns:: A tuple of torch tensors for time series and static inputs

cybench.models.residual_models module

class cybench.models.residual_models.InceptionTimeRes(**kwargs)

Bases: ResidualModel

_abc_impl = <_abc._abc_data object>

class cybench.models.residual_models.LSTMRes(**kwargs)

Bases: ResidualModel

_abc_impl = <_abc._abc_data object>

class cybench.models.residual_models.RandomForestRes(feature_cols: list = None)

Bases: ResidualModel

_abc_impl = <_abc._abc_data object>

class cybench.models.residual_models.ResidualModel(baseline_model: BaseModel)

Bases: BaseModel

_abc_impl = <_abc._abc_data object>

fit(dataset: Dataset, **fit_params)

Fit or train the model.

Parameters:

train_dataset (Dataset) – training dataset
**fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

load(model_name: str)

Deserialize a saved model.

Parameters:: model_name (str) – Filename that was used to save the model.
Returns:: The deserialized model.

predict(dataset: Dataset, **predict_params)

Run fitted model on batched data items.

Parameters:

dataset (Dataset) – test dataset
**predict_params – Additional parameters.

Returns:

A tuple containing a np.ndarray and a dict with additional information.

predict_items(X: list, crop=None, **predict_params)

Run fitted model on a list of data items.

Parameters:

X (list) – a list of data items, each of which is a dict
crop (str) – crop name
**predict_params – Additional parameters.

Returns:

A tuple containing a np.ndarray and a dict with additional information.

save(model_name: str)

Save model, e.g. using pickle. Check here for options to save and load scikit-learn models: https://scikit-learn.org/stable/model_persistence.html

Parameters:: model_name (str) – Filename that will be used to save the model.

class cybench.models.residual_models.RidgeRes(feature_cols: list = None)

Bases: ResidualModel

_abc_impl = <_abc._abc_data object>

class cybench.models.residual_models.TransformerRes(**kwargs)

Bases: ResidualModel

_abc_impl = <_abc._abc_data object>

cybench.models.sklearn_models module

class cybench.models.sklearn_models.BaseSklearnModel(**kwargs)

Bases: BaseModel

Base class for wrappers around scikit learn estimators.

_abc_impl = <_abc._abc_data object>

_design_features(crop: str, data_items: Iterable)

Design features using data samples.

Parameters:

crop (str) – crop name
data_items (Iterable) – a Dataset or list of data items.

Returns:

A pandas dataframe with KEY_LOC, KEY_YEAR and features.

_optimize_hyperparameters(X: ndarray, y: ndarray, param_space: dict, groups: ndarray = None, kfolds=5)

Optimize hyperparameters

Parameters:

X (np.ndarray) – training features
y (np.ndarray) – training labels
param_space (dict) – hyperparameters to optimize
groups (np.ndarray) – group values (e.g year values) for each row in X and y
kfolds (int) – number of splits in cross validation

Returns:

A sklearn pipeline refitted with the optimal hyperparameters.

_predict(crop: str, data_items: Iterable)

Utility method called by both predict_items and predict.

Parameters:

crop (str) – crop name
data_items (Iterable) – a Dataset or a list of data items

Returns:

A tuple containing a np.ndarray and a dict with additional information.

fit(dataset: Dataset, optimize_hyperparameters=False, select_features=False, **fit_params) → tuple

Fit or train the model.

Parameters:

dataset (Dataset) – training dataset
optimize_hyperparameters (bool) – flag to optimize hyperparameters
select_features (bool) – flat to select features
**fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

load(model_name: str)

Deserialize a saved model.

Parameters:: model_name (str) – Filename that was used to save the model.
Returns:: The deserialized model.

predict(dataset: Dataset, **predict_params)

Run fitted model on batched data items.

Parameters:

dataset (Dataset) – test dataset
**predict_params – Additional parameters.

Returns:

A tuple containing a np.ndarray and a dict with additional information.

predict_items(X: list, crop=None, **predict_params)

Run fitted model on a list of data items.

Parameters:

X (list) – a list of data items, each of which is a dict
crop (str) – crop name
**predict_params – Additional parameters.

Returns:

A tuple containing a np.ndarray and a dict with additional information.

save(model_name: str)

Save model, e.g. using pickle. Check here for options to save and load scikit-learn models: https://scikit-learn.org/stable/model_persistence.html

Parameters:: model_name (str) – Filename that will be used to save the model.

class cybench.models.sklearn_models.SklearnRandomForest(feature_cols: list = None)

Bases: BaseSklearnModel

_abc_impl = <_abc._abc_data object>

fit(train_dataset: Dataset, **fit_params)

Fit or train the model.

Parameters:

train_dataset (Dataset) – training dataset
**fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

class cybench.models.sklearn_models.SklearnRidge(feature_cols: list = None)

Bases: BaseSklearnModel

_abc_impl = <_abc._abc_data object>

fit(train_dataset: Dataset, **fit_params)

Fit or train the model.

Parameters:

train_dataset (Dataset) – training dataset
**fit_params – Additional parameters.

Returns:

A tuple containing the fitted model and a dict with additional information.

cybench.models.trend_models module

class cybench.models.trend_models.TrendModel

Bases: BaseModel

Default trend estimator.

Trend is estimated using years as features.

MAX_TREND_WINDOW_SIZE = 10

MIN_TREND_WINDOW_SIZE = 5

_abc_impl = <_abc._abc_data object>

_estimate_trend(trend_x: list, trend_y: list, test_x: int)

Implements a linear trend. From @mmeronijrc: Small sample sizes and the use of quadratic or loess trend can lead to strange results.

Parameters:

trend_x (list) – year in the trend window.
trend_y (list) – values (e.g. yields) in the trend window
test_x (int) – test year

Returns:

estimated trend (float)

_find_optimal_trend_window(train_labels: ndarray, window_years: list, extend_forward: bool = False)

Find the optimal trend window based on pymannkendall statistical test.

Parameters:

train_labels (np.ndarray) – years and values for a specific location.
window_years (list) – years to consider in a window
extend_forward (bool) – if true, extend trend window forward, else backward

Returns:

a list of years representing the optimal trend window

_predict_trend(test_data: Iterable)

Predict the trend for each data item in test data.

Parameters:: test_data (Iterable) – Dataset or a list of data items
Returns:: np.ndarray of predictions

fit(dataset: Dataset, **fit_params) → tuple

Fit or train the model. :param dataset: Dataset :param **fit_params: Additional parameters.

Returns:: A tuple containing the fitted model and a dict with additional information.

load(model_name)

Deserialize a saved model. :param model_name: Filename that was used to save the model.

Returns:: The deserialized model.

predict(dataset: Dataset, **predict_params)

Run fitted model on a test dataset.

Parameters:

dataset – Dataset
**predict_params – Additional parameters

Returns:

A tuple containing a np.ndarray and a dict with additional information.

predict_items(X: list, **predict_params)

Run fitted model on a list of data items.

Parameters:

X – a list of data items, each of which is a dict
**predict_params – Additional parameters

Returns:

A tuple containing a np.ndarray and a dict with additional information.

save(model_name): Save model, e.g. using pickle. :param model_name: Filename that will be used to save the model.

cybench.models package

Submodules

cybench.models.model module

cybench.models.naive_models module

cybench.models.nn_models module

cybench.models.residual_models module

cybench.models.sklearn_models module

cybench.models.trend_models module

Module contents