Configuring Scale and GPUs
Contents
Configuring Scale and GPUs#
Increasing the scale of a Ray Train training run is simple and can often be done in a few lines of code.
The main interface for configuring scale and resources
is the ScalingConfig.
Scaling Configurations in Train (ScalingConfig)#
The scaling configuration specifies distributed training properties like the number of workers or the resources per worker.
The properties of the scaling configuration are tunable.
from ray.train import ScalingConfig
scaling_config = ScalingConfig(
# Number of distributed workers.
num_workers=2,
# Turn on/off GPU.
use_gpu=True,
# Specify resources used for trainer.
trainer_resources={"CPU": 1},
# Try to schedule workers on different nodes.
placement_strategy="SPREAD",
)
See also
See the ScalingConfig API reference.
Increasing the number of workers#
The main interface to control parallelism in your training code is to set the
number of workers. This can be done by passing the num_workers attribute to
the ScalingConfig:
from ray.air.config import ScalingConfig
scaling_config = ScalingConfig(
num_workers=8
)
Using GPUs#
To use GPUs, pass use_gpu=True to the ScalingConfig.
This will request one GPU per training worker. In the example below, training will
run on 8 GPUs (8 workers, each using one GPU).
from ray.air.config import ScalingConfig
scaling_config = ScalingConfig(
num_workers=8,
use_gpu=True
)
More resources#
If you want to allocate more than one CPU or GPU per training worker, or if you
defined custom cluster resources, set
the resources_per_worker attribute:
from ray.air.config import ScalingConfig
scaling_config = ScalingConfig(
num_workers=8,
resources_per_worker={
"CPU": 4,
"GPU": 2,
}
use_gpu=True,
)
Note that if you specify GPUs in resources_per_worker, you also need to keep
use_gpu=True.
You can also instruct Ray Train to use fractional GPUs. In that case, multiple workers will be assigned the same CUDA device.
from ray.air.config import ScalingConfig
scaling_config = ScalingConfig(
num_workers=8,
resources_per_worker={
"CPU": 4,
"GPU": 0.5,
}
use_gpu=True,
)
Using GPUs in training code#
When use_gpu=True is set, Ray Train will automatically set up environment variables
in your training loop so that the GPUs can be detected and used
(e.g. CUDA_VISIBLE_DEVICES).
You can get the associated devices with ray.train.torch.get_device().
import torch
from ray.air.config import ScalingConfig
from ray.train.torch import TorchTrainer, get_device
def train_loop(config):
assert torch.cuda.is_available()
device = get_device()
assert device == torch.device("cuda:0")
trainer = TorchTrainer(
train_loop,
scaling_config=ScalingConfig(
num_workers=1,
use_gpu=True
)
)
trainer.fit()
Trainer resources#
So far we’ve configured resources for each training worker. Technically, each
training worker is a Ray Actor. Ray Train also schedules
an actor for the Trainer object.
This object often only manages lightweight communication between the training workers. You can still specify its resources, which can be useful if you implemented your own Trainer that does heavier processing.
from ray.air.config import ScalingConfig
scaling_config = ScalingConfig(
num_workers=8,
trainer_resources={
"CPU": 4,
"GPU": 1,
}
)
Per default, a trainer uses 1 CPU. If you have a cluster with 8 CPUs and want to start 4 training workers a 2 CPUs, this will not work, as the total number of required CPUs will be 9 (4 * 2 + 1). In that case, you can specify the trainer resources to use 0 CPUs:
from ray.air.config import ScalingConfig
scaling_config = ScalingConfig(
num_workers=4,
resources_per_worker={
"CPU": 2,
},
trainer_resources={
"CPU": 0,
}
)