ray.tune.syncer.SyncConfig
ray.tune.syncer.SyncConfig#
- class ray.tune.syncer.SyncConfig(upload_dir: Optional[str] = None, syncer: Optional[Union[str, ray.tune.syncer.Syncer]] = 'auto', sync_period: int = 300, sync_timeout: int = 1800, sync_artifacts: bool = True, sync_on_checkpoint: bool = True)[source]#
Bases:
objectConfiguration object for Tune syncing.
See Appendix: Types of data stored by Tune for an overview of what data is synchronized.
If a remote
RunConfig(storage_path)is specified, both experiment and trial checkpoints will be stored on remote (cloud) storage. Synchronization then only happens via uploading/downloading from this remote storage.There are a few scenarios where syncing takes place:
The Tune driver (on the head node) syncing the experiment directory to the cloud (which includes experiment state such as searcher state, the list of trials and their statuses, and trial metadata)
Workers directly syncing trial checkpoints to the cloud
Workers syncing their trial directories to the head node (Deprecated)
Workers syncing artifacts (which include all files saved in the trial directory except for checkpoints) directly to the cloud.
Warning
When running on multiple nodes, using the local filesystem of the head node as the persistent storage location is deprecated. If you save trial checkpoints and run on a multi-node cluster, Tune will raise an error by default, if NFS or cloud storage is not setup. See this issue for more information, including temporary workarounds as well as the deprecation and removal schedule.
See How to Configure Persistent Storage in Ray Tune for more details and examples.
- Parameters
upload_dir – This config is deprecated in favor of
RunConfig(storage_path).syncer – If a cloud
storage_pathis configured, then this config accepts a custom syncer subclassingSyncerwhich will be used to synchronize checkpoints to/from cloud storage. Defaults to"auto"(auto detect), which defaults to usepyarrow.fs.sync_period – Minimum time in seconds to wait between two sync operations. A smaller
sync_periodwill have more up-to-date data at the sync location but introduces more syncing overhead. Defaults to 5 minutes. Note: This applies to (1) and (3). Trial checkpoints are uploaded to the cloud synchronously on every checkpoint.sync_timeout – Maximum time in seconds to wait for a sync process to finish running. This is used to catch hanging sync operations so that experiment execution can continue and the syncs can be retried. Defaults to 30 minutes. Note: Currently, this timeout only affects cloud syncing: (1) and (2).
sync_artifacts – Whether or not to sync artifacts that are saved to the trial directory (accessed via
session.get_trial_dir()) to the cloud. Artifact syncing happens at the same frequency as trial checkpoint syncing. Note: This is scenario (4).sync_on_checkpoint – This config is deprecated. If True, a sync from a worker’s remote trial directory to the head node will be forced on every trial checkpoint, regardless of the
sync_period. Defaults to True. Note: This is ignored ifupload_diris specified, since this only applies to worker-to-head-node syncing (3).
PublicAPI: This API is stable across Ray releases.
Methods
validate_upload_dir([upload_dir])Checks if
upload_diris supported bysyncer.Attributes