Key Concepts of Ray Train
Contents
Key Concepts of Ray Train#
There are three main concepts in the Ray Train library.
Trainersexecute distributed training.Configurationobjects are used to configure training.Checkpointsare returned as the result of training.
Trainers#
Trainers are responsible for executing (distributed) training runs. The output of a Trainer run is a Result that contains metrics from the training run and the latest saved Checkpoint. Trainers can also be configured with Datasets and Preprocessors for scalable data ingest and preprocessing.
Deep Learning, Tree-Based, and other Trainers#
There are three categories of built-in Trainers:
Ray Train supports the following deep learning trainers:
For these trainers, you usually define your own training function that loads the model and executes single-worker training steps. Refer to the following guides for more details:
Tree-based trainers utilize gradient-based decision trees for training. The most popular libraries for this are XGBoost and LightGBM.
For these trainers, you just pass a dataset and parameters. The training loop is configured automatically.
Some trainers don’t fit into the other two categories, such as:
TransformersTrainerfor NLPRLTrainerfor reinforcement learningSklearnTrainerfor (non-distributed) training of sklearn models.
Train Configuration#
Trainers are configured with configuration objects. There are two main configuration classes,
the ScalingConfig and the RunConfig.
The latter contains subconfigurations, such as the FailureConfig,
SyncConfig and CheckpointConfig.
Train Checkpoints#
Calling Trainer.fit() returns a Result object, which includes
information about the run such as the reported metrics and the saved checkpoints.
Checkpoints have the following purposes:
They can be passed to a Trainer to resume training from the given model state.
They can be used to create a Predictor / BatchPredictor for scalable batch prediction.
They can be deployed with Ray Serve.