Configuration and Persistent Storage#

Run Configuration in Train (RunConfig)#

RunConfig is a configuration object used in Ray Train to define the experiment spec that corresponds to a call to trainer.fit().

It includes settings such as the experiment name, storage path for results, stopping conditions, custom callbacks, checkpoint configuration, verbosity level, and logging options.

Many of these settings are configured through other config objects and passed through the RunConfig. The following sub-sections contain descriptions of these configs.

The properties of the run configuration are not tunable.

from ray.train import RunConfig
from ray.air.integrations.wandb import WandbLoggerCallback

run_config = RunConfig(
    # Name of the training run (directory name).
    name="my_train_run",
    # The experiment results will be saved to: storage_path/name
    storage_path="~/ray_results",
    # storage_path="s3://my_bucket/tune_results",
    # Custom and built-in callbacks
    callbacks=[WandbLoggerCallback()],
    # Stopping criteria
    stop={"training_iteration": 10},
)

See also

See the RunConfig API reference.

See How to Configure Persistent Storage in Ray Tune for storage configuration examples (related to storage_path).

Persistent storage#

Ray Train saves results and checkpoints at a persistent storage location. Per default, this is a local directory in ~/ray_results.

This default setup is sufficient for single-node setups or distributed training without fault tolerance. When you want to utilize fault tolerance, require access to shared data, or are training on spot instances, it is recommended to set up a remote persistent storage location.

The persistent storage location can be defined by passing a storage_path to the RunConfig. This path can be a location on remote storage (e.g. S3), or it can be a shared network device, such as NFS.

# Remote storage location
run_config = RunConfig(storage_path="s3://my_bucket/train_results")

# Shared network filesystem
run_config = RunConfig(storage_path="/mnt/cluster_storage/train_results")

When configuring a persistent storage path, it is important that all nodes have access to the location.