Best practices in production
Contents
Best practices in production#
This section helps you:
Understand best practices when operating Serve in production
Learn more about managing Serve with the Serve CLI
Configure your HTTP requests when querying Serve
CLI best practices#
This section summarizes the best practices for deploying to production using the Serve CLI:
Use
serve runto manually test and improve your Serve application locally.Use
serve buildto create a Serve config file for your Serve application.For development, put your Serve application’s code in a remote repository and manually configure the
working_dirorpy_modulesfields in your Serve config file’sruntime_envto point to that repository.For production, put your Serve application’s code in a custom Docker image instead of a
runtime_env. See this tutorial to learn how to create custom Docker images and deploy them on KubeRay.
Use
serve statusto track your Serve application’s health and deployment progress.Use
serve configto check the latest config that your Serve application received. This is its goal state.Make lightweight configuration updates (e.g.,
num_replicasoruser_configchanges) by modifying your Serve config file and redeploying it withserve deploy.
Inspect an application with serve config and serve status#
Two Serve CLI commands help you inspect a Serve application in production: serve config and serve status.
If you have a remote cluster, serve config and serve status also has an --address/-a argument to access the cluster. See VM deployment for more information on this argument.
serve config gets the latest config file that the Ray Cluster received. This config file represents the Serve application’s goal state. The Ray Cluster constantly strives to reach and maintain this state by deploying deployments, and recovering failed replicas, and performing other relevant actions.
Using the fruit_config.yaml example from an earlier section:
$ ray start --head
$ serve deploy fruit_config.yaml
...
$ serve config
import_path: fruit:deployment_graph
runtime_env: {}
deployments:
- name: MangoStand
num_replicas: 2
route_prefix: null
...
serve status gets your Serve application’s current status. The status has two parts per application: the app_status and the deployment_statuses.
The app_status contains three fields:
status: A Serve application has four possible statuses:"NOT_STARTED": No application has been deployed on this cluster."DEPLOYING": The application is currently carrying out aserve deployrequest. It is deploying new deployments or updating existing ones."RUNNING": The application is at steady-state. It has finished executing any previousserve deployrequests, and is attempting to maintain the goal state set by the latestserve deployrequest."DEPLOY_FAILED": The latestserve deployrequest has failed.
message: Provides context on the current status.deployment_timestamp: A UNIX timestamp of when Serve received the lastserve deployrequest. The timestamp is calculated using theServeController’s local clock.
The deployment_statuses contains a list of dictionaries representing each deployment’s status. Each dictionary has three fields:
name: The deployment’s name.status: A Serve deployment has three possible statuses:"UPDATING": The deployment is updating to meet the goal state set by a previousdeployrequest."HEALTHY": The deployment achieved the latest requests goal state."UNHEALTHY": The deployment has either failed to update, or has updated and has become unhealthy afterwards. This condition may be due to an error in the deployment’s constructor, a crashed replica, or a general system or machine error.
message: Provides context on the current status.
Use the serve status command to inspect your deployments after they are deployed and throughout their lifetime.
Using the fruit_config.yaml example from an earlier section:
$ ray start --head
$ serve deploy fruit_config.yaml
...
$ serve status
app_status:
status: RUNNING
message: ''
deployment_timestamp: 1655771534.835145
deployment_statuses:
- name: MangoStand
status: HEALTHY
message: ''
- name: OrangeStand
status: HEALTHY
message: ''
- name: PearStand
status: HEALTHY
message: ''
- name: FruitMarket
status: HEALTHY
message: ''
- name: DAGDriver
status: HEALTHY
message: ''
For Kubernetes deployments with KubeRay, tighter integrations of serve status with Kubernetes are available. See Getting the status of Serve applications in Kubernetes.
Best practices for HTTP requests#
Most examples in these docs use straightforward get or post requests using Python’s requests library, such as:
import requests
response = requests.get("http://localhost:8000/")
result = response.text
This pattern is useful for prototyping, but it isn’t sufficient for production. In production, HTTP requests should use:
Retries: Requests may occasionally fail due to transient issues (e.g., slow network, node failure, power outage, spike in traffic, etc.). Retry failed requests a handful of times to account for these issues.
Exponential backoff: To avoid bombarding the Serve application with retries during a transient error, apply an exponential backoff on failure. Each retry should wait exponentially longer than the previous one before running. For example, the first retry may wait 0.1s after a failure, and subsequent retries wait 0.4s (4 x 0.1), 1.6s, 6.4s, 25.6s, etc. after the failure.
Timeouts: Add a timeout to each retry to prevent requests from hanging. The timeout should be longer than the application’s latency to give your application enough time to process requests. Additionally, set an end-to-end timeout in the Serve application, so slow requests don’t bottleneck replicas.
import requests
from requests.adapters import HTTPAdapter, Retry
session = requests.Session()
retries = Retry(
total=5, # 5 retries total
backoff_factor=1, # Exponential backoff
status_forcelist=[ # Retry on server errors
500,
501,
502,
503,
504,
],
)
session.mount("http://", HTTPAdapter(max_retries=retries))
response = session.get("http://localhost:8000/", timeout=10) # Add timeout
result = response.text