Skip to content

[KubeRay][Docs] Simplify KubeRay upgrade steps and always upgrade CRDs first #44826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 26, 2024

Conversation

andrewsykim
Copy link
Member

@andrewsykim andrewsykim commented Apr 18, 2024

Why are these changes needed?

Simplifes the KubeRay upgrade steps and strongly recommends upgrading CRDs first. Also removes recommendation for deleting CRDs as this will result in RayCluster, RayJob and RayService resources being deleted.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

# upgrade the CRD to v1.1.0
# NOTE: here we use kubectl because Helm does not support lifecycle management of CRDs
# See the Helm documentation for more details: https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#some-caveats-and-explanations
$ kubectl apply --server-side --force-conflicts -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.1.0"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --force-conflict is needed here to resolve server-side conflicts with helm:

$ kubectl apply --server-side -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.1.0"
Apply failed with 2 conflicts: conflicts with "helm" using apiextensions.k8s.io/v1:
- .metadata.annotations.controller-gen.kubebuilder.io/version
- .spec.versions
Please review the fields above--they currently have other managers. Here
are the ways you can resolve this warning:
* If you intend to manage all of these fields, please re-run the apply
  command with the `--force-conflicts` flag.
* If you do not intend to manage all of the fields, please edit your
  manifest to remove references to the fields that should keep their
  current managers.
* You may co-own fields by updating your manifest to match the existing
  value; in this case, you'll become the manager if the other manager(s)
  stop managing the field (remove it from their configuration).
See https://kubernetes.io/docs/reference/using-api/server-side-apply/#conflicts

I think it's fine to force resolve these conflicts because Helm is not managing CRDs after initial install anyways.

2. Upgrade the kuberay-operator image to the new version
3. Verify successful upgrade

Here's an example when upgrading KubeRay from v1.0.0 to v1.1.0:
Copy link
Member Author

@andrewsykim andrewsykim Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still provide step-by-step example for the v0.6.0 -> v1.0.0 upgrade?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an example for v0.6.0 -> v1.0.0 for consistency with the old docs

@kevin85421 kevin85421 self-assigned this Apr 19, 2024
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will test this manually.

@Yicheng-Lu-llll
Copy link
Contributor

LGTM! I manually tested it, and everything works as expected.

@kevin85421
Copy link
Member

@andrewsykim I have already requested our doc team to review this PR.

Copy link
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits.

# Install RayCluster v1.0.0 which uses CRD v1.
helm install raycluster kuberay/ray-cluster --version 1.0.0
```
Due to reliability and security implication of webhooks, KubeRay doesn't support a conversion webhook to convert v1alpha1 to v1 APIs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Due to reliability and security implication of webhooks, KubeRay doesn't support a conversion webhook to convert v1alpha1 to v1 APIs.
Due to reliability and security implications of webhooks, KubeRay doesn't support a conversion webhook to convert v1alpha1 to v1 APIs.

```
Due to reliability and security implication of webhooks, KubeRay doesn't support a conversion webhook to convert v1alpha1 to v1 APIs.

To upgrade the KubeRay version, it's highly recommended to follow these steps in order:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To upgrade the KubeRay version, it's highly recommended to follow these steps in order:
To upgrade the KubeRay version, follow these steps in order:


To upgrade the KubeRay version, it's highly recommended to follow these steps in order:
1. Upgrade the CRD manifest, containing new fields added to the v1 CRDs.
2. Upgrade the kuberay-operator image to the new version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Upgrade the kuberay-operator image to the new version
2. Upgrade the kuberay-operator image to the new version.

To upgrade the KubeRay version, it's highly recommended to follow these steps in order:
1. Upgrade the CRD manifest, containing new fields added to the v1 CRDs.
2. Upgrade the kuberay-operator image to the new version
3. Verify successful upgrade
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. Verify successful upgrade
3. Verify the success of the upgrade.

2. Upgrade the kuberay-operator image to the new version
3. Verify successful upgrade

Here's an example when upgrading KubeRay from v1.0.0 to v1.1.0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Here's an example when upgrading KubeRay from v1.0.0 to v1.1.0:
The following is an example of upgrading KubeRay from v1.0.0 to v1.1.0:

$ helm install raycluster kuberay/ray-cluster --version 1.1.0
```

Here's an example when upgrading KubeRay from v0.6.0 to v1.0.0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Here's an example when upgrading KubeRay from v0.6.0 to v1.0.0:
The following is an example of upgrading KubeRay from v0.6.0 to v1.0.0:


Here's an example when upgrading KubeRay from v0.6.0 to v1.0.0:
```
# upgrade the CRD to v1.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# upgrade the CRD to v1.0.0
# Upgrade the CRD to v1.0.0.

Here's an example when upgrading KubeRay from v0.6.0 to v1.0.0:
```
# upgrade the CRD to v1.0.0
# NOTE: here we use kubectl because Helm does not support lifecycle management of CRDs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# NOTE: here we use kubectl because Helm does not support lifecycle management of CRDs
# Note: This example uses kubectl because Helm doesn't support lifecycle management of CRDs.

# See the Helm documentation for more details: https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#some-caveats-and-explanations
$ kubectl apply --server-side --force-conflicts -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.0.0"

# Upgrade kuberay-operator to v1.0.0. This step does not upgrade the CRDs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Upgrade kuberay-operator to v1.0.0. This step does not upgrade the CRDs.
# Upgrade kuberay-operator to v1.0.0. This step doesn't upgrade the CRDs.

# Upgrade kuberay-operator to v1.0.0. This step does not upgrade the CRDs.
$ helm upgrade kuberay-operator kuberay/kuberay-operator --version v1.0.0

# Install a RayCluster using the v1.0.0 helm chart to verify successful upgrade
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Install a RayCluster using the v1.0.0 helm chart to verify successful upgrade
# Install a RayCluster using the v1.0.0 helm chart to verify the success of the upgrade.

@andrewsykim andrewsykim force-pushed the kuberay-api-docs branch 2 times, most recently from 27d4c96 to 0f6847e Compare April 24, 2024 01:44
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
@andrewsykim
Copy link
Member Author

FYI I updated the doc to use kubectl replace instead of kubectl apply after discussion with Matyas and today's community meeting

# Upgrade the CRD to v1.0.0.
# Note: This example uses kubectl because Helm doesn't support lifecycle management of CRDs.
# See the Helm documentation for more details: https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#some-caveats-and-explanations
$ kubectl replace -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.0.0"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this step causes disruptions for any running RayService resources until you apply the RayService config again. See this slack thread for more details: https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1701370178891199

Copy link
Member Author

@andrewsykim andrewsykim Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that specific to upgrading the CRD first only or are you referring to the entire v0.6.0 to v1.0.0 upgrade? As far as I'm aware the 0.6.0 -> 1.0.0 is known to be disruptive and breaking changes like that are expected but I'm not super familiar with RayService changes from 0.6.0 and 1.0.0 yet to understand what is expected

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From your comment here https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1701443938862159?thread_ts=1701370178.891199&cid=C02GFQ82JPM

it looks like upgrading the CRD first didn't cause any issues, but it was the operator upgrade that was problematic. Maybe we need specific guidnace for 0.6.0 -> 1.0.0 based on known upgrade issues?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes sorry, upgrading the operator causes these issues.
I think 0.6.0 -> 1.0.0 should have specific guidance

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @andrewsykim do you plan to address this comment in this PR? If you want to do it in a follow-up PR, I am comfortable merging this PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're planning to add the 0.6.0 -> 1.0.0 guidance in a separate PR, I'd suggest changing the example here to show upgrade from 1.0.0 -> 1.1.0 or something like that otherwise it could be misleading

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smit-kiri are you interested in opening a PR to share your experience after this PR got merged?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old guidance suggests deleting the CRD, so I think what we have in this PR is still better

I don't have enough experience with 0.6.0 to 1.0.0 upgrade issues right now so it'll have to be a follow up PR if we add one.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 Yeah I can open a PR!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @smit-kiri!

# Upgrade the CRD to v1.1.0.
# Note: This example uses kubectl because Helm doesn't support lifecycle management of CRDs.
# See the Helm documentation for more details: https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#some-caveats-and-explanations
$ kubectl replace -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.1.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for using kubectl replace? Is it to avoid the size limit of kubectl.kubernetes.io/last-applied-configuration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Using kubectl apply --server-side also works around the size limit, but I think kubectl replace here is fine because we do not expect multiple writers to the CRD anyways and the manifest in release should be the source of truth overwriting any previous updates by other writers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With kubectl replace we need to be careful not to use the --force command because that will delete the CRD and recreate it. But I think it's fine here as long as we don't recommend it

Copy link
Member Author

@andrewsykim andrewsykim Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using kubectl apply --server-side --force-conflicts -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.0.0" also works well, so I'm fine with putting either one in the docs. kubectl replace seems a bit cleaner

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thank you for the explanation!

@jjyao jjyao merged commit f6ba2e7 into ray-project:master Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants