ray.train.sklearn.SklearnTrainer
ray.train.sklearn.SklearnTrainer#
- class ray.train.sklearn.SklearnTrainer(*args, **kwargs)[source]#
Bases:
ray.train.base_trainer.BaseTrainerA Trainer for scikit-learn estimator training.
This Trainer runs the
fitmethod of the given estimator in a non-distributed manner on a single Ray Actor.By default, the
n_jobs(orthread_count) estimator parameters will be set to match the number of CPUs assigned to the Ray Actor. This behavior can be disabled by settingset_estimator_cpus=False.If you wish to use GPU-enabled estimators (eg. cuML), make sure to set
"GPU": 1inscaling_config.trainer_resources.The results are reported all at once and not in an iterative fashion. No checkpointing is done during training. This may be changed in the future.
Example:
import ray from ray.train.sklearn import SklearnTrainer from sklearn.ensemble import RandomForestRegressor train_dataset = ray.data.from_items( [{"x": x, "y": x + 1} for x in range(32)]) trainer = SklearnTrainer( estimator=RandomForestRegressor(), label_column="y", scaling_config=ray.train.ScalingConfig( trainer_resources={"CPU": 4} ), datasets={"train": train_dataset} ) result = trainer.fit()
- Parameters
estimator – A scikit-learn compatible estimator to use.
datasets – Datasets to use for training and validation. Must include a “train” key denoting the training dataset. If a
preprocessoris provided and has not already been fit, it will be fit on the training dataset. All datasets will be transformed by thepreprocessorif one is provided. All non-training datasets will be used as separate validation sets, each reporting separate metrics.label_column – Name of the label column. A column with this name must be present in the training dataset. If None, no validation will be performed.
params – Optional dict of params to be set on the estimator before fitting. Useful for hyperparameter tuning.
scoring –
Strategy to evaluate the performance of the model on the validation sets and for cross-validation. Same as in
sklearn.model_selection.cross_validation. Ifscoringrepresents a single score, one can use:a single string;
a callable that returns a single value.
If
scoringrepresents multiple scores, one can use:a list or tuple of unique strings;
a callable returning a dictionary where the keys are the metric names and the values are the metric scores;
a dictionary with metric names as keys and callables a values.
cv –
Determines the cross-validation splitting strategy. If specified, cross-validation will be run on the train dataset, in addition to computing metrics for validation datasets. Same as in
sklearn.model_selection.cross_validation, with the exception of None. Possible inputs forcvare:None, to skip cross-validation.
int, to specify the number of folds in a
(Stratified)KFold,CV splitter,An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and
yis either binary or multiclass,StratifiedKFoldis used. In all other cases,KFoldis used. These splitters are instantiated withshuffle=Falseso the splits will be the same across calls.If you provide a “cv_groups” column in the train dataset, it will be used as group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a “Group”
cvinstance (e.g.,GroupKFold). This corresponds to thegroupsargument insklearn.model_selection.cross_validation.return_train_score_cv – Whether to also return train scores during cross-validation. Ignored if
cvis None.parallelize_cv – If set to True, will parallelize cross-validation instead of the estimator. If set to None, will detect if the estimator has any parallelism-related params (
n_jobsorthread_count) and parallelize cross-validation if there are none. If False, will not parallelize cross-validation. Cannot be set to True if there are any GPUs assigned to the trainer. Ignored ifcvis None.set_estimator_cpus – If set to True, will automatically set the values of all
n_jobsandthread_countparameters in the estimator (including in nested objects) to match the number of available CPUs.scaling_config – Configuration for how to scale training. Only the
trainer_resourceskey can be provided, as the training is not distributed.run_config – Configuration for the execution of the training run.
preprocessor – A ray.data.Preprocessor to preprocess the provided datasets.
**fit_params – Additional kwargs passed to
estimator.fit()method.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
Converts self to a
tune.Trainableclass.can_restore(path)Checks whether a given directory contains a restorable Train experiment.
fit()Runs training.
Called during fit() to preprocess dataset attributes with preprocessor.
restore(path[, datasets, preprocessor, ...])Restores a Train experiment from a previously interrupted/failed run.
setup()Called during fit() to perform initial setup on the Trainer.