[kuberay] add guide for reducing image pull latency #49891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

angelinalg merged 1 commit into ray-project:master from andrewsykim:optimize-ray-cluster-start-up-guide

Feb 12, 2025

Member

andrewsykim commented Jan 16, 2025

Why are these changes needed?

We are getting a lot of feedback from users that a guide for how to reduce image pull latency would be useful. This is the initial version of a guide that covers some strategies to reduce image pull latency.

Related issue number

ray-project/kuberay#2742

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

andrewsykim requested review from maxpumperla, pcmoritz, kevin85421 and a team as code owners

January 16, 2025 21:22

andrewsykim force-pushed the optimize-ray-cluster-start-up-guide branch from 1bba9fb to 4e06abe Compare

January 16, 2025 21:27

jcotant1 added the core label

andrewsykim force-pushed the optimize-ray-cluster-start-up-guide branch 4 times, most recently from 48cd425 to 55a8638 Compare

January 17, 2025 18:05

andrewsykim mentioned this pull request

[Doc] Best practice to reduce image pulling overhead ray-project/kuberay#2742

Closed

2 tasks

kevin85421 self-assigned this

andrewsykim force-pushed the optimize-ray-cluster-start-up-guide branch from 55a8638 to 6fd5a2f Compare

February 10, 2025 21:32

chiayi approved these changes

View reviewed changes

Contributor

chiayi left a comment

LGTM!

kevin85421 added the go label

Member

kevin85421 commented Feb 11, 2025

I have already requested our doc team to review this PR.

andrewsykim force-pushed the optimize-ray-cluster-start-up-guide branch from 6fd5a2f to 59d82c6 Compare

February 11, 2025 18:24

Collaborator

cszhu commented Feb 12, 2025

LGTM 👍

cszhu reviewed

View reviewed changes

doc/source/cluster/kubernetes/user-guides/reduce-image-pull-latency.md Outdated

    
              ### Preload images into machine images

              Some cloud providers allow you to build custom machine images for your Kubernetes nodes. Including your Ray images in these custom machine images ensures that images are cached locally when your nodes start up, avoiding the need to pull them from a remote registry. While this approach can be effective, it is generally not recommended, as changing machine images often requires multiple steps and is tightly coupled to the lifecycle of your nodes.

Collaborator

cszhu Feb 12, 2025

nit: Do you think it would be good to expand on the pros and cons of images being cached locally? It kind of hints on why it isn't recommended, but it might be good to expand on the specific reasons why (for better clarity).

angelinalg approved these changes

View reviewed changes

Contributor

angelinalg left a comment

Some style nits. Please correct the subject for my suggestions where I converted passive to active voice.

doc/source/cluster/kubernetes/user-guides/reduce-image-pull-latency.md Outdated

    
              # Reducing image pull latency on Kubernetes

              This guide outlines strategies to reduce image pull latency for Ray clusters on Kubernetes. Some of these strategies are provider-agnostic and can be used on any Kubernetes cluster, while others leverage capabilities specific to certain cloud providers.

Contributor

angelinalg Feb 11, 2025

Suggested change

      
            This guide outlines strategies to reduce image pull latency for Ray clusters on Kubernetes. Some of these strategies are provider-agnostic and can be used on any Kubernetes cluster, while others leverage capabilities specific to certain cloud providers.
          
            This guide outlines strategies to reduce image pull latency for Ray clusters on Kubernetes. Some of these strategies are provider-agnostic so you can use them on any Kubernetes cluster, while others leverage capabilities specific to certain cloud providers.

doc/source/cluster/kubernetes/user-guides/reduce-image-pull-latency.md Outdated

    
              ## Image pull latency

              Ray container images can often be large (several gigabytes), primarily due to the Python dependencies included.

Contributor

angelinalg Feb 12, 2025

Suggested change

      
            Ray container images can often be large (several gigabytes), primarily due to the Python dependencies included.
          
            Ray container images can often be several gigabytes, primarily due to the Python dependencies included.

doc/source/cluster/kubernetes/user-guides/reduce-image-pull-latency.md Outdated

    
              ## Strategies for reducing image pulling latency

              Here are some strategies you can use to reduce image pull latency:

Contributor

angelinalg Feb 12, 2025

Suggested change

      
            Here are some strategies you can use to reduce image pull latency:
          
            The following sections discuss strategies for reducing image pull latency.

doc/source/cluster/kubernetes/user-guides/reduce-image-pull-latency.md Outdated

    
              ### Preload images on every node using a Daemonset

              You can ensure that your Ray images are always cached on every node by running a DaemonSet that pre-pulls the images. This approach ensures that the image is downloaded to each node, reducing the time to pull the image when a pod needs to be scheduled.

Contributor

angelinalg Feb 12, 2025

Suggested change

      
            You can ensure that your Ray images are always cached on every node by running a DaemonSet that pre-pulls the images. This approach ensures that the image is downloaded to each node, reducing the time to pull the image when a pod needs to be scheduled.
          
            You can ensure that your Ray images are always cached on every node by running a DaemonSet that pre-pulls the images. This approach ensures that Ray downloads the image to each node, reducing the time to pull the image when a Ray needs to schedule a pod.

doc/source/cluster/kubernetes/user-guides/reduce-image-pull-latency.md Outdated

    
              You can ensure that your Ray images are always cached on every node by running a DaemonSet that pre-pulls the images. This approach ensures that the image is downloaded to each node, reducing the time to pull the image when a pod needs to be scheduled.

              Here's an example DaemonSet configuration that uses the image `rayproject/ray:2.40.0`:

Contributor

angelinalg Feb 12, 2025

Suggested change

      
            Here's an example DaemonSet configuration that uses the image `rayproject/ray:2.40.0`:
          
            The following is an example DaemonSet configuration that uses the image `rayproject/ray:2.40.0`:

doc/source/cluster/kubernetes/user-guides/reduce-image-pull-latency.md Outdated

    
              Only container images hosted on [Artifact Registry](https://cloud.google.com/artifact-registry/docs/overview) are eligible for Image Streaming.

              > **Note:** You might not notice the benefits of Image Streaming during the first pull of an eligible image. However, after Image Streaming caches the image, future image pulls on any cluster benefit from Image Streaming.

Contributor

angelinalg Feb 12, 2025

Suggested change

      
            > **Note:** You might not notice the benefits of Image Streaming during the first pull of an eligible image. However, after Image Streaming caches the image, future image pulls on any cluster benefit from Image Streaming.
          
            > **Note:** You might not notice the benefits of Image streaming during the first pull of an eligible image. However, after Image streaming caches the image, future image pulls on any cluster benefit from Image streaming.

doc/source/cluster/kubernetes/user-guides/reduce-image-pull-latency.md Outdated

    
              ### Enable secondary boot disks (GKE only)

              If you're using Google Kubernetes Engine (GKE), you can enable [secondary bootdisk to preload data or container images](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading).

Contributor

angelinalg Feb 12, 2025

Suggested change

      
            If you're using Google Kubernetes Engine (GKE), you can enable [secondary bootdisk to preload data or container images](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading).
          
            If you're using Google Kubernetes Engine (GKE), you can enable the [secondary bootdisk to preload data or container images](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading).

doc/source/cluster/kubernetes/user-guides/reduce-image-pull-latency.md Outdated

    
              If you're using Google Kubernetes Engine (GKE), you can enable [secondary bootdisk to preload data or container images](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading).

              Secondary boot disks are enabled per node pool. Once enabled, GKE attaches a Persistent Disk to each node within the node pool.

Contributor

angelinalg Feb 12, 2025

Suggested change

      
            Secondary boot disks are enabled per node pool. Once enabled, GKE attaches a Persistent Disk to each node within the node pool.
          
            GKE enables secondary boot disks per node pool. Once enabled, GKE attaches a Persistent Disk to each node within the node pool.

doc/source/cluster/kubernetes/user-guides/reduce-image-pull-latency.md Outdated

    
              If you're using Google Kubernetes Engine (GKE), you can enable [secondary bootdisk to preload data or container images](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading).

              Secondary boot disks are enabled per node pool. Once enabled, GKE attaches a Persistent Disk to each node within the node pool.

              The images within the Persistent Disk are immediately accessible to containerd once workloads are scheduled on those nodes.

Contributor

angelinalg Feb 12, 2025

Suggested change

      
            The images within the Persistent Disk are immediately accessible to containerd once workloads are scheduled on those nodes.
          
            The images within the Persistent Disk are immediately accessible to containers once Ray schedules workloads on those nodes.

doc/source/cluster/kubernetes/user-guides/reduce-image-pull-latency.md Outdated

    
              The images within the Persistent Disk are immediately accessible to containerd once workloads are scheduled on those nodes.

              Including Ray images in the secondary boot disk can significantly reduce image pull latency.

              Refer to [Prepare the secondary boot disk image](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#prepare) for detailed steps on how to prepare the secondary boot disk and [Configure the secondary boot disk](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#configure) for information on how to enable secondary boot disks for your node pools.

Contributor

angelinalg Feb 12, 2025

Suggested change

      
            Refer to [Prepare the secondary boot disk image](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#prepare) for detailed steps on how to prepare the secondary boot disk and [Configure the secondary boot disk](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#configure) for information on how to enable secondary boot disks for your node pools.
          
            See [Prepare the secondary boot disk image](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#prepare) for detailed steps on how to prepare the secondary boot disk. See [Configure the secondary boot disk](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#configure) for how to enable secondary boot disks for your node pools.

cszhu reviewed

View reviewed changes

doc/source/cluster/kubernetes/user-guides/reduce-image-pull-latency.md Outdated

    
              Here are some strategies you can use to reduce image pull latency:

              ### Preload images on every node using a Daemonset

Collaborator

cszhu Feb 12, 2025

Suggested change

      
            ### Preload images on every node using a Daemonset
          
            ### Preload images on every node using a DaemonSet

Capitalizing for consistency

cszhu reviewed

View reviewed changes

doc/source/cluster/kubernetes/user-guides/reduce-image-pull-latency.md Outdated

    
              ## Image pull latency

              Ray container images can often be large (several gigabytes), primarily due to the Python dependencies included.

              Other factors can also contribute to image size. Pulling large images from remote repositories can significantly increase the startup time for your Ray clusters. The time required to download an image depends on several factors, including:

Collaborator

cszhu Feb 12, 2025

Suggested change

      
            Other factors can also contribute to image size. Pulling large images from remote repositories can significantly increase the startup time for your Ray clusters. The time required to download an image depends on several factors, including:
          
            Other factors can also contribute to image size. Pulling large images from remote repositories can slow down Ray cluster startup times. The time required to download an image depends on several factors, including:

Contributor

angelinalg commented Feb 12, 2025

Please consider using Vale to catch typos for future PRs: https://docs.ray.io/en/master/ray-contribute/docs.html#how-to-use-vale
Thanks for contributing to the docs!


          [kuberay] add guide for reducing image pull latency

6f76b2c

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

andrewsykim force-pushed the optimize-ray-cluster-start-up-guide branch from 59d82c6 to 6f76b2c Compare

February 12, 2025 20:02

Member Author

andrewsykim commented Feb 12, 2025

Thanks for the review, comments addressed

angelinalg merged commit 854ee39 into ray-project:master

5 checks passed

xsuler pushed a commit to antgroup/ant-ray that referenced this pull request


          [kuberay] add guide for reducing image pull latency (ray-project#49891)

27e688d

xsuler pushed a commit to antgroup/ant-ray that referenced this pull request


          [kuberay] add guide for reducing image pull latency (ray-project#49891)

3eb815b

park12sj pushed a commit to park12sj/ray that referenced this pull request


          [kuberay] add guide for reducing image pull latency (ray-project#49891)

39ee97d

hainesmichaelc added the community-backlog label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

cszhu cszhu left review comments

angelinalg angelinalg approved these changes

maxpumperla Awaiting requested review from maxpumperla

pcmoritz Awaiting requested review from pcmoritz

kevin85421 Awaiting requested review from kevin85421

+1 more reviewer

chiayi chiayi approved these changes

Labels

community-backlog core go