-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[KubeRay][Docs] Simplify KubeRay upgrade steps and always upgrade CRDs first #44826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f86fe51
to
fce2c50
Compare
# upgrade the CRD to v1.1.0 | ||
# NOTE: here we use kubectl because Helm does not support lifecycle management of CRDs | ||
# See the Helm documentation for more details: https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#some-caveats-and-explanations | ||
$ kubectl apply --server-side --force-conflicts -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.1.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The --force-conflict
is needed here to resolve server-side conflicts with helm:
$ kubectl apply --server-side -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.1.0"
Apply failed with 2 conflicts: conflicts with "helm" using apiextensions.k8s.io/v1:
- .metadata.annotations.controller-gen.kubebuilder.io/version
- .spec.versions
Please review the fields above--they currently have other managers. Here
are the ways you can resolve this warning:
* If you intend to manage all of these fields, please re-run the apply
command with the `--force-conflicts` flag.
* If you do not intend to manage all of the fields, please edit your
manifest to remove references to the fields that should keep their
current managers.
* You may co-own fields by updating your manifest to match the existing
value; in this case, you'll become the manager if the other manager(s)
stop managing the field (remove it from their configuration).
See https://kubernetes.io/docs/reference/using-api/server-side-apply/#conflicts
I think it's fine to force resolve these conflicts because Helm is not managing CRDs after initial install anyways.
2. Upgrade the kuberay-operator image to the new version | ||
3. Verify successful upgrade | ||
|
||
Here's an example when upgrading KubeRay from v1.0.0 to v1.1.0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we still provide step-by-step example for the v0.6.0 -> v1.0.0 upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an example for v0.6.0 -> v1.0.0 for consistency with the old docs
fce2c50
to
f2f4f96
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will test this manually.
LGTM! I manually tested it, and everything works as expected. |
@andrewsykim I have already requested our doc team to review this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nits.
# Install RayCluster v1.0.0 which uses CRD v1. | ||
helm install raycluster kuberay/ray-cluster --version 1.0.0 | ||
``` | ||
Due to reliability and security implication of webhooks, KubeRay doesn't support a conversion webhook to convert v1alpha1 to v1 APIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to reliability and security implication of webhooks, KubeRay doesn't support a conversion webhook to convert v1alpha1 to v1 APIs. | |
Due to reliability and security implications of webhooks, KubeRay doesn't support a conversion webhook to convert v1alpha1 to v1 APIs. |
``` | ||
Due to reliability and security implication of webhooks, KubeRay doesn't support a conversion webhook to convert v1alpha1 to v1 APIs. | ||
|
||
To upgrade the KubeRay version, it's highly recommended to follow these steps in order: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To upgrade the KubeRay version, it's highly recommended to follow these steps in order: | |
To upgrade the KubeRay version, follow these steps in order: |
|
||
To upgrade the KubeRay version, it's highly recommended to follow these steps in order: | ||
1. Upgrade the CRD manifest, containing new fields added to the v1 CRDs. | ||
2. Upgrade the kuberay-operator image to the new version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Upgrade the kuberay-operator image to the new version | |
2. Upgrade the kuberay-operator image to the new version. |
To upgrade the KubeRay version, it's highly recommended to follow these steps in order: | ||
1. Upgrade the CRD manifest, containing new fields added to the v1 CRDs. | ||
2. Upgrade the kuberay-operator image to the new version | ||
3. Verify successful upgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. Verify successful upgrade | |
3. Verify the success of the upgrade. |
2. Upgrade the kuberay-operator image to the new version | ||
3. Verify successful upgrade | ||
|
||
Here's an example when upgrading KubeRay from v1.0.0 to v1.1.0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example when upgrading KubeRay from v1.0.0 to v1.1.0: | |
The following is an example of upgrading KubeRay from v1.0.0 to v1.1.0: |
$ helm install raycluster kuberay/ray-cluster --version 1.1.0 | ||
``` | ||
|
||
Here's an example when upgrading KubeRay from v0.6.0 to v1.0.0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example when upgrading KubeRay from v0.6.0 to v1.0.0: | |
The following is an example of upgrading KubeRay from v0.6.0 to v1.0.0: |
|
||
Here's an example when upgrading KubeRay from v0.6.0 to v1.0.0: | ||
``` | ||
# upgrade the CRD to v1.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# upgrade the CRD to v1.0.0 | |
# Upgrade the CRD to v1.0.0. |
Here's an example when upgrading KubeRay from v0.6.0 to v1.0.0: | ||
``` | ||
# upgrade the CRD to v1.0.0 | ||
# NOTE: here we use kubectl because Helm does not support lifecycle management of CRDs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# NOTE: here we use kubectl because Helm does not support lifecycle management of CRDs | |
# Note: This example uses kubectl because Helm doesn't support lifecycle management of CRDs. |
# See the Helm documentation for more details: https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#some-caveats-and-explanations | ||
$ kubectl apply --server-side --force-conflicts -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.0.0" | ||
|
||
# Upgrade kuberay-operator to v1.0.0. This step does not upgrade the CRDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Upgrade kuberay-operator to v1.0.0. This step does not upgrade the CRDs. | |
# Upgrade kuberay-operator to v1.0.0. This step doesn't upgrade the CRDs. |
# Upgrade kuberay-operator to v1.0.0. This step does not upgrade the CRDs. | ||
$ helm upgrade kuberay-operator kuberay/kuberay-operator --version v1.0.0 | ||
|
||
# Install a RayCluster using the v1.0.0 helm chart to verify successful upgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Install a RayCluster using the v1.0.0 helm chart to verify successful upgrade | |
# Install a RayCluster using the v1.0.0 helm chart to verify the success of the upgrade. |
27d4c96
to
0f6847e
Compare
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
0f6847e
to
1f69ed4
Compare
FYI I updated the doc to use |
# Upgrade the CRD to v1.0.0. | ||
# Note: This example uses kubectl because Helm doesn't support lifecycle management of CRDs. | ||
# See the Helm documentation for more details: https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#some-caveats-and-explanations | ||
$ kubectl replace -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this step causes disruptions for any running RayService
resources until you apply the RayService
config again. See this slack thread for more details: https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1701370178891199
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that specific to upgrading the CRD first only or are you referring to the entire v0.6.0 to v1.0.0 upgrade? As far as I'm aware the 0.6.0 -> 1.0.0 is known to be disruptive and breaking changes like that are expected but I'm not super familiar with RayService changes from 0.6.0 and 1.0.0 yet to understand what is expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From your comment here https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1701443938862159?thread_ts=1701370178.891199&cid=C02GFQ82JPM
it looks like upgrading the CRD first didn't cause any issues, but it was the operator upgrade that was problematic. Maybe we need specific guidnace for 0.6.0 -> 1.0.0 based on known upgrade issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes sorry, upgrading the operator causes these issues.
I think 0.6.0 -> 1.0.0 should have specific guidance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @andrewsykim do you plan to address this comment in this PR? If you want to do it in a follow-up PR, I am comfortable merging this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're planning to add the 0.6.0 -> 1.0.0 guidance in a separate PR, I'd suggest changing the example here to show upgrade from 1.0.0 -> 1.1.0 or something like that otherwise it could be misleading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smit-kiri are you interested in opening a PR to share your experience after this PR got merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old guidance suggests deleting the CRD, so I think what we have in this PR is still better
I don't have enough experience with 0.6.0 to 1.0.0 upgrade issues right now so it'll have to be a follow up PR if we add one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevin85421 Yeah I can open a PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @smit-kiri!
# Upgrade the CRD to v1.1.0. | ||
# Note: This example uses kubectl because Helm doesn't support lifecycle management of CRDs. | ||
# See the Helm documentation for more details: https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#some-caveats-and-explanations | ||
$ kubectl replace -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.1.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for using kubectl replace
? Is it to avoid the size limit of kubectl.kubernetes.io/last-applied-configuration
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Using kubectl apply --server-side
also works around the size limit, but I think kubectl replace
here is fine because we do not expect multiple writers to the CRD anyways and the manifest in release should be the source of truth overwriting any previous updates by other writers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With kubectl replace
we need to be careful not to use the --force
command because that will delete the CRD and recreate it. But I think it's fine here as long as we don't recommend it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using kubectl apply --server-side --force-conflicts -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.0.0"
also works well, so I'm fine with putting either one in the docs. kubectl replace
seems a bit cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thank you for the explanation!
Why are these changes needed?
Simplifes the KubeRay upgrade steps and strongly recommends upgrading CRDs first. Also removes recommendation for deleting CRDs as this will result in RayCluster, RayJob and RayService resources being deleted.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.