-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[kuberay] add guide for reducing image pull latency #49891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[kuberay] add guide for reducing image pull latency #49891
Conversation
1bba9fb
to
4e06abe
Compare
48cd425
to
55a8638
Compare
55a8638
to
6fd5a2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I have already requested our doc team to review this PR. |
6fd5a2f
to
59d82c6
Compare
LGTM 👍 |
|
||
### Preload images into machine images | ||
|
||
Some cloud providers allow you to build custom machine images for your Kubernetes nodes. Including your Ray images in these custom machine images ensures that images are cached locally when your nodes start up, avoiding the need to pull them from a remote registry. While this approach can be effective, it is generally not recommended, as changing machine images often requires multiple steps and is tightly coupled to the lifecycle of your nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do you think it would be good to expand on the pros and cons of images being cached locally? It kind of hints on why it isn't recommended, but it might be good to expand on the specific reasons why (for better clarity).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some style nits. Please correct the subject for my suggestions where I converted passive to active voice.
|
||
# Reducing image pull latency on Kubernetes | ||
|
||
This guide outlines strategies to reduce image pull latency for Ray clusters on Kubernetes. Some of these strategies are provider-agnostic and can be used on any Kubernetes cluster, while others leverage capabilities specific to certain cloud providers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This guide outlines strategies to reduce image pull latency for Ray clusters on Kubernetes. Some of these strategies are provider-agnostic and can be used on any Kubernetes cluster, while others leverage capabilities specific to certain cloud providers. | |
This guide outlines strategies to reduce image pull latency for Ray clusters on Kubernetes. Some of these strategies are provider-agnostic so you can use them on any Kubernetes cluster, while others leverage capabilities specific to certain cloud providers. |
|
||
## Image pull latency | ||
|
||
Ray container images can often be large (several gigabytes), primarily due to the Python dependencies included. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ray container images can often be large (several gigabytes), primarily due to the Python dependencies included. | |
Ray container images can often be several gigabytes, primarily due to the Python dependencies included. |
|
||
## Strategies for reducing image pulling latency | ||
|
||
Here are some strategies you can use to reduce image pull latency: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some strategies you can use to reduce image pull latency: | |
The following sections discuss strategies for reducing image pull latency. |
|
||
### Preload images on every node using a Daemonset | ||
|
||
You can ensure that your Ray images are always cached on every node by running a DaemonSet that pre-pulls the images. This approach ensures that the image is downloaded to each node, reducing the time to pull the image when a pod needs to be scheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can ensure that your Ray images are always cached on every node by running a DaemonSet that pre-pulls the images. This approach ensures that the image is downloaded to each node, reducing the time to pull the image when a pod needs to be scheduled. | |
You can ensure that your Ray images are always cached on every node by running a DaemonSet that pre-pulls the images. This approach ensures that Ray downloads the image to each node, reducing the time to pull the image when a Ray needs to schedule a pod. |
|
||
You can ensure that your Ray images are always cached on every node by running a DaemonSet that pre-pulls the images. This approach ensures that the image is downloaded to each node, reducing the time to pull the image when a pod needs to be scheduled. | ||
|
||
Here's an example DaemonSet configuration that uses the image `rayproject/ray:2.40.0`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example DaemonSet configuration that uses the image `rayproject/ray:2.40.0`: | |
The following is an example DaemonSet configuration that uses the image `rayproject/ray:2.40.0`: |
|
||
Only container images hosted on [Artifact Registry](https://cloud.google.com/artifact-registry/docs/overview) are eligible for Image Streaming. | ||
|
||
> **Note:** You might not notice the benefits of Image Streaming during the first pull of an eligible image. However, after Image Streaming caches the image, future image pulls on any cluster benefit from Image Streaming. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> **Note:** You might not notice the benefits of Image Streaming during the first pull of an eligible image. However, after Image Streaming caches the image, future image pulls on any cluster benefit from Image Streaming. | |
> **Note:** You might not notice the benefits of Image streaming during the first pull of an eligible image. However, after Image streaming caches the image, future image pulls on any cluster benefit from Image streaming. |
|
||
### Enable secondary boot disks (GKE only) | ||
|
||
If you're using Google Kubernetes Engine (GKE), you can enable [secondary bootdisk to preload data or container images](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're using Google Kubernetes Engine (GKE), you can enable [secondary bootdisk to preload data or container images](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading). | |
If you're using Google Kubernetes Engine (GKE), you can enable the [secondary bootdisk to preload data or container images](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading). |
|
||
If you're using Google Kubernetes Engine (GKE), you can enable [secondary bootdisk to preload data or container images](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading). | ||
|
||
Secondary boot disks are enabled per node pool. Once enabled, GKE attaches a Persistent Disk to each node within the node pool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Secondary boot disks are enabled per node pool. Once enabled, GKE attaches a Persistent Disk to each node within the node pool. | |
GKE enables secondary boot disks per node pool. Once enabled, GKE attaches a Persistent Disk to each node within the node pool. |
If you're using Google Kubernetes Engine (GKE), you can enable [secondary bootdisk to preload data or container images](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading). | ||
|
||
Secondary boot disks are enabled per node pool. Once enabled, GKE attaches a Persistent Disk to each node within the node pool. | ||
The images within the Persistent Disk are immediately accessible to containerd once workloads are scheduled on those nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The images within the Persistent Disk are immediately accessible to containerd once workloads are scheduled on those nodes. | |
The images within the Persistent Disk are immediately accessible to containers once Ray schedules workloads on those nodes. |
The images within the Persistent Disk are immediately accessible to containerd once workloads are scheduled on those nodes. | ||
Including Ray images in the secondary boot disk can significantly reduce image pull latency. | ||
|
||
Refer to [Prepare the secondary boot disk image](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#prepare) for detailed steps on how to prepare the secondary boot disk and [Configure the secondary boot disk](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#configure) for information on how to enable secondary boot disks for your node pools. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer to [Prepare the secondary boot disk image](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#prepare) for detailed steps on how to prepare the secondary boot disk and [Configure the secondary boot disk](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#configure) for information on how to enable secondary boot disks for your node pools. | |
See [Prepare the secondary boot disk image](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#prepare) for detailed steps on how to prepare the secondary boot disk. See [Configure the secondary boot disk](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#configure) for how to enable secondary boot disks for your node pools. |
|
||
Here are some strategies you can use to reduce image pull latency: | ||
|
||
### Preload images on every node using a Daemonset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Preload images on every node using a Daemonset | |
### Preload images on every node using a DaemonSet |
Capitalizing for consistency
## Image pull latency | ||
|
||
Ray container images can often be large (several gigabytes), primarily due to the Python dependencies included. | ||
Other factors can also contribute to image size. Pulling large images from remote repositories can significantly increase the startup time for your Ray clusters. The time required to download an image depends on several factors, including: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other factors can also contribute to image size. Pulling large images from remote repositories can significantly increase the startup time for your Ray clusters. The time required to download an image depends on several factors, including: | |
Other factors can also contribute to image size. Pulling large images from remote repositories can slow down Ray cluster startup times. The time required to download an image depends on several factors, including: |
Please consider using Vale to catch typos for future PRs: https://docs.ray.io/en/master/ray-contribute/docs.html#how-to-use-vale |
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
59d82c6
to
6f76b2c
Compare
Thanks for the review, comments addressed |
Why are these changes needed?
We are getting a lot of feedback from users that a guide for how to reduce image pull latency would be useful. This is the initial version of a guide that covers some strategies to reduce image pull latency.
Related issue number
ray-project/kuberay#2742
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.