Skip to content

[kuberay] add guide for reducing image pull latency #49891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

andrewsykim
Copy link
Member

Why are these changes needed?

We are getting a lot of feedback from users that a guide for how to reduce image pull latency would be useful. This is the initial version of a guide that covers some strategies to reduce image pull latency.

Related issue number

ray-project/kuberay#2742

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@andrewsykim andrewsykim force-pushed the optimize-ray-cluster-start-up-guide branch from 1bba9fb to 4e06abe Compare January 16, 2025 21:27
@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Jan 17, 2025
@andrewsykim andrewsykim force-pushed the optimize-ray-cluster-start-up-guide branch 4 times, most recently from 48cd425 to 55a8638 Compare January 17, 2025 18:05
@kevin85421 kevin85421 self-assigned this Jan 18, 2025
@andrewsykim andrewsykim force-pushed the optimize-ray-cluster-start-up-guide branch from 55a8638 to 6fd5a2f Compare February 10, 2025 21:32
Copy link
Contributor

@chiayi chiayi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Feb 11, 2025
@kevin85421
Copy link
Member

I have already requested our doc team to review this PR.

@andrewsykim andrewsykim force-pushed the optimize-ray-cluster-start-up-guide branch from 6fd5a2f to 59d82c6 Compare February 11, 2025 18:24
@cszhu
Copy link
Collaborator

cszhu commented Feb 12, 2025

LGTM 👍


### Preload images into machine images

Some cloud providers allow you to build custom machine images for your Kubernetes nodes. Including your Ray images in these custom machine images ensures that images are cached locally when your nodes start up, avoiding the need to pull them from a remote registry. While this approach can be effective, it is generally not recommended, as changing machine images often requires multiple steps and is tightly coupled to the lifecycle of your nodes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do you think it would be good to expand on the pros and cons of images being cached locally? It kind of hints on why it isn't recommended, but it might be good to expand on the specific reasons why (for better clarity).

Copy link
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some style nits. Please correct the subject for my suggestions where I converted passive to active voice.


# Reducing image pull latency on Kubernetes

This guide outlines strategies to reduce image pull latency for Ray clusters on Kubernetes. Some of these strategies are provider-agnostic and can be used on any Kubernetes cluster, while others leverage capabilities specific to certain cloud providers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This guide outlines strategies to reduce image pull latency for Ray clusters on Kubernetes. Some of these strategies are provider-agnostic and can be used on any Kubernetes cluster, while others leverage capabilities specific to certain cloud providers.
This guide outlines strategies to reduce image pull latency for Ray clusters on Kubernetes. Some of these strategies are provider-agnostic so you can use them on any Kubernetes cluster, while others leverage capabilities specific to certain cloud providers.


## Image pull latency

Ray container images can often be large (several gigabytes), primarily due to the Python dependencies included.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Ray container images can often be large (several gigabytes), primarily due to the Python dependencies included.
Ray container images can often be several gigabytes, primarily due to the Python dependencies included.


## Strategies for reducing image pulling latency

Here are some strategies you can use to reduce image pull latency:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Here are some strategies you can use to reduce image pull latency:
The following sections discuss strategies for reducing image pull latency.


### Preload images on every node using a Daemonset

You can ensure that your Ray images are always cached on every node by running a DaemonSet that pre-pulls the images. This approach ensures that the image is downloaded to each node, reducing the time to pull the image when a pod needs to be scheduled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can ensure that your Ray images are always cached on every node by running a DaemonSet that pre-pulls the images. This approach ensures that the image is downloaded to each node, reducing the time to pull the image when a pod needs to be scheduled.
You can ensure that your Ray images are always cached on every node by running a DaemonSet that pre-pulls the images. This approach ensures that Ray downloads the image to each node, reducing the time to pull the image when a Ray needs to schedule a pod.


You can ensure that your Ray images are always cached on every node by running a DaemonSet that pre-pulls the images. This approach ensures that the image is downloaded to each node, reducing the time to pull the image when a pod needs to be scheduled.

Here's an example DaemonSet configuration that uses the image `rayproject/ray:2.40.0`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Here's an example DaemonSet configuration that uses the image `rayproject/ray:2.40.0`:
The following is an example DaemonSet configuration that uses the image `rayproject/ray:2.40.0`:


Only container images hosted on [Artifact Registry](https://cloud.google.com/artifact-registry/docs/overview) are eligible for Image Streaming.

> **Note:** You might not notice the benefits of Image Streaming during the first pull of an eligible image. However, after Image Streaming caches the image, future image pulls on any cluster benefit from Image Streaming.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> **Note:** You might not notice the benefits of Image Streaming during the first pull of an eligible image. However, after Image Streaming caches the image, future image pulls on any cluster benefit from Image Streaming.
> **Note:** You might not notice the benefits of Image streaming during the first pull of an eligible image. However, after Image streaming caches the image, future image pulls on any cluster benefit from Image streaming.


### Enable secondary boot disks (GKE only)

If you're using Google Kubernetes Engine (GKE), you can enable [secondary bootdisk to preload data or container images](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you're using Google Kubernetes Engine (GKE), you can enable [secondary bootdisk to preload data or container images](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading).
If you're using Google Kubernetes Engine (GKE), you can enable the [secondary bootdisk to preload data or container images](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading).


If you're using Google Kubernetes Engine (GKE), you can enable [secondary bootdisk to preload data or container images](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading).

Secondary boot disks are enabled per node pool. Once enabled, GKE attaches a Persistent Disk to each node within the node pool.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Secondary boot disks are enabled per node pool. Once enabled, GKE attaches a Persistent Disk to each node within the node pool.
GKE enables secondary boot disks per node pool. Once enabled, GKE attaches a Persistent Disk to each node within the node pool.

If you're using Google Kubernetes Engine (GKE), you can enable [secondary bootdisk to preload data or container images](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading).

Secondary boot disks are enabled per node pool. Once enabled, GKE attaches a Persistent Disk to each node within the node pool.
The images within the Persistent Disk are immediately accessible to containerd once workloads are scheduled on those nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The images within the Persistent Disk are immediately accessible to containerd once workloads are scheduled on those nodes.
The images within the Persistent Disk are immediately accessible to containers once Ray schedules workloads on those nodes.

The images within the Persistent Disk are immediately accessible to containerd once workloads are scheduled on those nodes.
Including Ray images in the secondary boot disk can significantly reduce image pull latency.

Refer to [Prepare the secondary boot disk image](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#prepare) for detailed steps on how to prepare the secondary boot disk and [Configure the secondary boot disk](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#configure) for information on how to enable secondary boot disks for your node pools.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Refer to [Prepare the secondary boot disk image](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#prepare) for detailed steps on how to prepare the secondary boot disk and [Configure the secondary boot disk](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#configure) for information on how to enable secondary boot disks for your node pools.
See [Prepare the secondary boot disk image](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#prepare) for detailed steps on how to prepare the secondary boot disk. See [Configure the secondary boot disk](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#configure) for how to enable secondary boot disks for your node pools.


Here are some strategies you can use to reduce image pull latency:

### Preload images on every node using a Daemonset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Preload images on every node using a Daemonset
### Preload images on every node using a DaemonSet

Capitalizing for consistency

## Image pull latency

Ray container images can often be large (several gigabytes), primarily due to the Python dependencies included.
Other factors can also contribute to image size. Pulling large images from remote repositories can significantly increase the startup time for your Ray clusters. The time required to download an image depends on several factors, including:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Other factors can also contribute to image size. Pulling large images from remote repositories can significantly increase the startup time for your Ray clusters. The time required to download an image depends on several factors, including:
Other factors can also contribute to image size. Pulling large images from remote repositories can slow down Ray cluster startup times. The time required to download an image depends on several factors, including:

@angelinalg
Copy link
Contributor

Please consider using Vale to catch typos for future PRs: https://docs.ray.io/en/master/ray-contribute/docs.html#how-to-use-vale
Thanks for contributing to the docs!

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
@andrewsykim andrewsykim force-pushed the optimize-ray-cluster-start-up-guide branch from 59d82c6 to 6f76b2c Compare February 12, 2025 20:02
@andrewsykim
Copy link
Member Author

Thanks for the review, comments addressed

@angelinalg angelinalg merged commit 854ee39 into ray-project:master Feb 12, 2025
5 checks passed
xsuler pushed a commit to antgroup/ant-ray that referenced this pull request Mar 4, 2025
xsuler pushed a commit to antgroup/ant-ray that referenced this pull request Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-backlog core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants