cybench.models package
Submodules
cybench.models.model module
Model base class
- class cybench.models.model.BaseModel
Bases:
ABC
- _abc_impl = <_abc._abc_data object>
- abstract fit(dataset: Dataset, **fit_params) tuple
Fit or train the model.
- Parameters:
dataset – Dataset
**fit_params – Additional parameters.
- Returns:
A tuple containing the fitted model and a dict with additional information.
- abstract load(model_name)
Deserialize a saved model.
- Parameters:
model_name – Filename that was used to save the model.
- Returns:
The deserialized model.
- predict(dataset: Dataset, **predict_params) tuple
Run fitted model on data.
- Parameters:
dataset – Dataset
**predict_params – Additional parameters.
- Returns:
A tuple containing a np.ndarray and a dict with additional information.
- abstract predict_items(X: list, **predict_params)
Run fitted model on a list of data items.
- Parameters:
X – a list of data items, each of which is a dict
**predict_params – Additional parameters.
- Returns:
A tuple containing a np.ndarray and a dict with additional information.
- abstract save(model_name)
Save model, e.g. using pickle.
- Parameters:
model_name – Filename that will be used to save the model.
cybench.models.naive_models module
- class cybench.models.naive_models.AverageYieldModel(group_by=['adm_id'])
Bases:
BaseModel
A naive yield prediction model.
Predicts the average of the training set by location. If the location is not in the training data, then predicts the global average.
- _abc_impl = <_abc._abc_data object>
- fit(dataset: Dataset, **fit_params) tuple
Fit or train the model.
- Parameters:
dataset – Dataset
**fit_params – Additional parameters.
- Returns:
A tuple containing the fitted model and a dict with additional information.
- load(model_name)
Deserialize a saved model.
- Parameters:
model_name – Filename that was used to save the model.
- Returns:
The deserialized model.
- predict_items(X: list)
Run fitted model on a list of data items.
- Parameters:
X – a list of data items, each of which is a dict
- Returns:
A tuple containing a np.ndarray and a dict with additional information.
- save(model_name)
Save model, e.g. using pickle.
- Parameters:
model_name – Filename that will be used to save the model.
cybench.models.nn_models module
- class cybench.models.nn_models.BaseNNModel(**kwargs)
Bases:
BaseModel
,Module
- _abc_impl = <_abc._abc_data object>
- _forward_pass(batch: dict, device: str)
A forward pass for batched data.
- Parameters:
batch (dict) – batched inputs
device (str) – the device to use
- Returns:
An np.ndarray
- _normalize_inputs(inputs)
Normalize inputs using saved normalization parameters.
- Parameters:
inputs (dict) – unnormalized inputs
- Returns:
The same dict after normalizing the entries
- _optimize_hyperparameters(dataset: Dataset, param_space: dict, optim_kwargs: dict, device: str = 'cpu', kfolds: int = 1, epochs: int = 10, **fit_params) dict
Optimize hyperparameters
- Parameters:
dataset (Dataset) – training dataset
param_space (dict) – hypperparameters to optimize
optim_kwargs (dict) – arguments to the optimizer
device (str) – the device to use
kfolds (int) – k for k-fold cv (default: 1)
epochs (int) – Number of epochs to train (default: 10)
**fit_params – Additional parameters.
- Returns:
A dict of optimal hyperparameter setting
- _train_and_validate(dataset: ~cybench.datasets.dataset.Dataset, train_years: list, val_years: list, validation_interval: int = 5, epochs: int = 10, batch_size: int = 16, optimizer_fn: callable = <class 'torch.optim.adam.Adam'>, optim_kwargs: dict = {}, loss_fn: callable = <function mse_loss>, loss_kwargs: dict = {}, scheduler_fn: callable | None = None, sched_kwargs: dict = {}, device: str = 'cpu', **kwargs)
Fit or train the model and evaluate on validation data.
- Parameters:
dataset (Dataset) – training dataset
train_years (list) – training years
val_years (list) – validation years
validation_interval (int) – validation frequency (default: 5)
epochs (int) – the number of epochs to train the model (default: 10)
batch_size (int) – the batch size (default: 16)
optim_fn (callable) – the optimizer function (default: Adam)
optim_kwargs (dict) – arguments to the optimizer function
loss_fn (callable) – the loss function (default: mse_loss)
loss_kwargs (dict) – arguments to the loss function
scheduler_fn (callable) – the scheduler function (default: None)
sched_kwargs (dict) – arguments to the scheduler function
device (str) – the device to use
**kwargs – Additional parameters.
- Returns:
A tuple training losses, validation losses and maximum epochs to train.
- _train_epoch(tqdm_loader: ~tqdm.std.tqdm, device: str, optimizer: ~torch.optim.optimizer.Optimizer, loss_fn: callable = <function mse_loss>, loss_kwargs: dict = {'reduction': 'mean'}, scheduler: ~torch.optim.lr_scheduler.LRScheduler | None = None)
Run one epoch during training
- Parameters:
tqdm_loader (tqdm) – data loader with progress bar
device (str) – the device to use
optimizer (torch.optim.Optimizer) – the optimizer
loss_fn (callable) – the loss function, default mse_loss
loss_kwargs (dict) – the arguments to loss_fn
scheduler (torch.optim.lr_scheduler.LRScheduler) – scheduler for learning rate of optimizer
- Returns:
The average of all batch losses
- _train_final_model(dataset: ~cybench.datasets.dataset.Dataset, epochs: int, optimizer_fn: callable = <class 'torch.optim.adam.Adam'>, optim_kwargs: dict = {}, loss_fn: callable = <function mse_loss>, loss_kwargs: dict = {'reduction': 'mean'}, scheduler_fn: callable | None = None, sched_kwargs: dict = {}, device: str = 'cpu', batch_size: int = 16, **kwargs)
Fit or train the model on the entire training set.
- Parameters:
dataset (Dataset) – training dataset,
epochs (int) – number of epochs to train
optimizer_fn (callable) – the optimizer function (default: Adam)
optim_kwargs (dict) – arguments to the optimizer function
loss_fn (callable) – the loss function (default: mse_loss)
loss_kwargs (dict) – arguments to the loss function
scheduler_fn (callable) – the scheduler function (default: None)
sched_kwargs (dict) – arguments to the scheduler function
device (str) – the device to use
batch_size (int) – default is 16
**kwargs – Additional parameters.
- Returns:
A list of training losses (one value per epoch).
- fit(dataset: Dataset, optimize_hyperparameters: bool = False, param_space: dict | None = None, optim_kwargs: dict = {}, device: str = 'cpu', seed: int = 42, **fit_params)
Fit or train the model.
- Parameters:
dataset (Dataset) – training dataset.
optimize_hyperparameters (bool) – whether to tune hyperparameters
param_space (dict) – each entry is a hyperparameter name and list or range of values
optim_kwargs (dict) – arguments to the optimizer
device (str) – the device to use.
seed (float) – seed for random number generator
**fit_params – Additional parameters.
- Returns:
A tuple containing the fitted model and a dict with additional information.
- classmethod load(model_name)
Load model using torch.load.
- Parameters:
model_name (str) – Filename that was used to save the model.
- Returns:
The loaded model.
- predict(dataset: Dataset, device: str = 'cpu', batch_size: int = 16, **predict_params)
Run fitted model on batched data items.
- Parameters:
dataset (Dataset) – validation dataset
device (str) – the device to use
**predict_params – Additional parameters
- Returns:
A tuple containing a np.ndarray and a dict with additional information.
- predict_items(X: list, device: str = 'cpu', **predict_params)
Run fitted model on a list of data items.
- Parameters:
X (list) – a list of data items, each of which is a dict
device (str) – str, the device to use
**predict_params – Additional parameters
- Returns:
A tuple containing a np.ndarray and a dict with additional information.
- save(model_name)
Save model using torch.save.
- Parameters:
model_name (str) – Filename that will be used to save the model.
- class cybench.models.nn_models.BaselineInceptionTime(time_series_have_same_length=False, hidden_size=64, num_layers=6, num_features=32, output_size=1, **kwargs)
Bases:
BaseNNModel
InceptionTime model.
- Parameters:
time_series_have_same_length (bool) – whether time series have the same length
hidden_size (int) – The number of features InceptionTime outputs
num_layers (int) – The number of InceptionBlocks. Defaults to 6.
num_features (int) – The number of features within the InceptionBlocks. Defaults to 32.
output_size (int) – The number of output classes. Defaults to 1.
**kwargs – Additional keyword arguments passed to the base class.
- _abc_impl = <_abc._abc_data object>
- fit(dataset: Dataset, optimize_hyperparameters: bool = False, param_space: dict = {}, kfolds: int = 1, epochs: int = 10, device: str = 'cpu', seed: int = 42, **fit_params)
Fit or train the model.
- Parameters:
dataset (Dataset) – Training dataset.
optimize_hyperparameters (bool) – Flag to tune hyperparameters.
param_space (dict) – Each entry is a hyperparameter name and list or range of values.
kfolds (int) – k in k-fold cv.
epochs (int) – Number of epochs to train.
seed (float) – seed for random number generator.
**fit_params – Additional parameters.
- Returns:
A tuple containing the fitted model and a dict with additional information.
- forward(x)
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class cybench.models.nn_models.BaselineLSTM(time_series_have_same_length=False, hidden_size=64, num_layers=1, output_size=1, **kwargs)
Bases:
BaseNNModel
LSTM model.
- Parameters:
time_series_have_same_length (bool) – whether time series have the same length
hidden_size (int) – The number of features InceptionTime outputs
num_layers (int) – The number of InceptionBlocks. Defaults to 6.
output_size (int) – The number of output classes. Defaults to 1.
**kwargs – Additional keyword arguments passed to the base class.
- _abc_impl = <_abc._abc_data object>
- fit(dataset: Dataset, optimize_hyperparameters: bool = False, param_space: dict = {}, kfolds: int = 1, epochs: int = 10, device: str = 'cpu', seed: int = 42, **fit_params)
Fit or train the model.
- Parameters:
dataset (Dataset) – Training dataset.
optimize_hyperparameters (bool) – Flag to tune hyperparameters.
param_space (dict) – Each entry is a hyperparameter name and list or range of values.
kfolds (int) – k in k-fold cv.
epochs (int) – Number of epochs to train.
seed (float) – seed for random number generator.
**fit_params – Additional parameters.
- Returns:
A tuple containing the fitted model and a dict with additional information.
- forward(x)
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class cybench.models.nn_models.BaselineTransformer(seq_len, time_series_have_same_length=False, hidden_size=64, d_model=64, n_head=1, d_ff=256, output_size=1, num_layers=3, **kwargs)
Bases:
BaseNNModel
Transformer model.
- Parameters:
seq_len (int) – length of time series sequence (in days)
time_series_have_same_length (bool) – whether time series have the same length
hidden_size (int) – The number of resulting timeseries features.
d_moodel (int) – Total dimension of the model.
n_head (int) – Parallel attention heads.
d_ffn (int) – The dimension of the feedforward network model.
output_size (int) – The number of output classes. Defaults to 1.
num_layers (1) – The number of sub-encoder-layers in the encoder.
**kwargs – Additional keyword arguments passed to the base class.
- _abc_impl = <_abc._abc_data object>
- fit(dataset: Dataset, optimize_hyperparameters: bool = False, param_space: dict = {}, kfolds: int = 1, epochs: int = 10, device: str = 'cpu', seed: int = 42, **fit_params)
Fit or train the model.
- Parameters:
dataset (Dataset) – Training dataset.
optimize_hyperparameters (bool) – Flag to tune hyperparameters.
param_space (dict) – Each entry is a hyperparameter name and list or range of values.
kfolds (int) – k in k-fold cv.
epochs (int) – Number of epochs to train.
seed (float) – seed for random number generator.
**fit_params – Additional parameters.
- Returns:
A tuple containing the fitted model and a dict with additional information.
- forward(x)
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- cybench.models.nn_models.separate_ts_static_inputs(batch: dict) tuple
Stack time series and static inputs separately.
- Parameters:
batch (dict) – batched inputs
- Returns:
A tuple of torch tensors for time series and static inputs
cybench.models.residual_models module
- class cybench.models.residual_models.InceptionTimeRes(**kwargs)
Bases:
ResidualModel
- _abc_impl = <_abc._abc_data object>
- class cybench.models.residual_models.LSTMRes(**kwargs)
Bases:
ResidualModel
- _abc_impl = <_abc._abc_data object>
- class cybench.models.residual_models.RandomForestRes(feature_cols: list | None = None)
Bases:
ResidualModel
- _abc_impl = <_abc._abc_data object>
- class cybench.models.residual_models.ResidualModel(baseline_model: BaseModel)
Bases:
BaseModel
- _abc_impl = <_abc._abc_data object>
- fit(dataset: Dataset, **fit_params)
Fit or train the model.
- Parameters:
train_dataset (Dataset) – training dataset
**fit_params – Additional parameters.
- Returns:
A tuple containing the fitted model and a dict with additional information.
- load(model_name: str)
Deserialize a saved model.
- Parameters:
model_name (str) – Filename that was used to save the model.
- Returns:
The deserialized model.
- predict(dataset: Dataset, **predict_params)
Run fitted model on batched data items.
- Parameters:
dataset (Dataset) – test dataset
**predict_params – Additional parameters.
- Returns:
A tuple containing a np.ndarray and a dict with additional information.
- predict_items(X: list, crop=None, **predict_params)
Run fitted model on a list of data items.
- Parameters:
X (list) – a list of data items, each of which is a dict
crop (str) – crop name
**predict_params – Additional parameters.
- Returns:
A tuple containing a np.ndarray and a dict with additional information.
- save(model_name: str)
Save model, e.g. using pickle. Check here for options to save and load scikit-learn models: https://scikit-learn.org/stable/model_persistence.html
- Parameters:
model_name (str) – Filename that will be used to save the model.
- class cybench.models.residual_models.RidgeRes(feature_cols: list | None = None)
Bases:
ResidualModel
- _abc_impl = <_abc._abc_data object>
- class cybench.models.residual_models.TransformerRes(**kwargs)
Bases:
ResidualModel
- _abc_impl = <_abc._abc_data object>
cybench.models.sklearn_models module
- class cybench.models.sklearn_models.BaseSklearnModel(**kwargs)
Bases:
BaseModel
Base class for wrappers around scikit learn estimators.
- _abc_impl = <_abc._abc_data object>
- _design_features(crop: str, data_items: Iterable)
Design features using data samples.
- Parameters:
crop (str) – crop name
data_items (Iterable) – a Dataset or list of data items.
- Returns:
A pandas dataframe with KEY_LOC, KEY_YEAR and features.
- _optimize_hyperparameters(X: ndarray, y: ndarray, param_space: dict, groups: ndarray | None = None, kfolds=5)
Optimize hyperparameters
- Parameters:
X (np.ndarray) – training features
y (np.ndarray) – training labels
param_space (dict) – hyperparameters to optimize
groups (np.ndarray) – group values (e.g year values) for each row in X and y
kfolds (int) – number of splits in cross validation
- Returns:
A sklearn pipeline refitted with the optimal hyperparameters.
- _predict(crop: str, data_items: Iterable)
Utility method called by both predict_items and predict.
- Parameters:
crop (str) – crop name
data_items (Iterable) – a Dataset or a list of data items
- Returns:
A tuple containing a np.ndarray and a dict with additional information.
- fit(dataset: Dataset, optimize_hyperparameters=False, select_features=False, **fit_params) tuple
Fit or train the model.
- Parameters:
dataset (Dataset) – training dataset
optimize_hyperparameters (bool) – flag to optimize hyperparameters
select_features (bool) – flat to select features
**fit_params – Additional parameters.
- Returns:
A tuple containing the fitted model and a dict with additional information.
- load(model_name: str)
Deserialize a saved model.
- Parameters:
model_name (str) – Filename that was used to save the model.
- Returns:
The deserialized model.
- predict(dataset: Dataset, **predict_params)
Run fitted model on batched data items.
- Parameters:
dataset (Dataset) – test dataset
**predict_params – Additional parameters.
- Returns:
A tuple containing a np.ndarray and a dict with additional information.
- predict_items(X: list, crop=None, **predict_params)
Run fitted model on a list of data items.
- Parameters:
X (list) – a list of data items, each of which is a dict
crop (str) – crop name
**predict_params – Additional parameters.
- Returns:
A tuple containing a np.ndarray and a dict with additional information.
- save(model_name: str)
Save model, e.g. using pickle. Check here for options to save and load scikit-learn models: https://scikit-learn.org/stable/model_persistence.html
- Parameters:
model_name (str) – Filename that will be used to save the model.
- class cybench.models.sklearn_models.SklearnRandomForest(feature_cols: list | None = None)
Bases:
BaseSklearnModel
- _abc_impl = <_abc._abc_data object>
- class cybench.models.sklearn_models.SklearnRidge(feature_cols: list | None = None)
Bases:
BaseSklearnModel
- _abc_impl = <_abc._abc_data object>
cybench.models.trend_models module
- class cybench.models.trend_models.TrendModel
Bases:
BaseModel
Default trend estimator.
Trend is estimated using years as features.
- MAX_TREND_WINDOW_SIZE = 10
- MIN_TREND_WINDOW_SIZE = 5
- _abc_impl = <_abc._abc_data object>
- _estimate_trend(trend_x: list, trend_y: list, test_x: int)
Implements a linear trend. From @mmeronijrc: Small sample sizes and the use of quadratic or loess trend an lead to strange results.
- Parameters:
trend_x (list) – year in the trend window.
trend_y (list) – values (e.g. yields) in the trend window
test_x (int) – test year
- Returns:
estimated trend (float)
- _find_optimal_trend_window(train_labels: ndarray, window_years: list, extend_forward: bool = False)
Find the optimal trend window based on pymannkendall statistical test.
- Parameters:
train_labels (np.ndarray) – years and values for a specific location.
window_years (list) – years to consider in a window
extend_forward (bool) – if true, extend trend window forward, else backward
- Returns:
a list of years representing the optimal trend window
- _predict_trend(test_data: Iterable)
Predict the trend for each data item in test data.
- Parameters:
test_data (Iterable) – Dataset or a list of data items
- Returns:
np.ndarray of predictions
- fit(dataset: Dataset, **fit_params) tuple
Fit or train the model. :param dataset: Dataset :param **fit_params: Additional parameters.
- Returns:
A tuple containing the fitted model and a dict with additional information.
- load(model_name)
Deserialize a saved model. :param model_name: Filename that was used to save the model.
- Returns:
The deserialized model.
- predict(dataset: Dataset, **predict_params)
Run fitted model on a test dataset.
- Parameters:
dataset – Dataset
**predict_params – Additional parameters
- Returns:
A tuple containing a np.ndarray and a dict with additional information.
- predict_items(X: list, **predict_params)
Run fitted model on a list of data items.
- Parameters:
X – a list of data items, each of which is a dict
**predict_params – Additional parameters
- Returns:
A tuple containing a np.ndarray and a dict with additional information.
- save(model_name)
Save model, e.g. using pickle. :param model_name: Filename that will be used to save the model.