Skip to content

[Bug] All worker Pods are deleted if using KubeRay v1.0.0 CRD with KubeRay operator v1.1.0 image #2087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Apr 17, 2024

Why are these changes needed?

Some users only upgrade the KubeRay image without upgrading the CRD. For example, when a user upgrades the KubeRay operator from v1.0.0 to v1.1.0 without upgrading the CRD, the KubeRay operator will use the zero value of NumOfHosts in the CRD. Hence, all worker Pods will be deleted. This PR ensures that NumOfHosts will always be larger than or equal to 1.

Related issue number

Closes #2084

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
  • Step 1: Install a KubeRay operator via KubeRay v1.0.0 Helm chart.
  • Step 2: Install a RayCluster via KubeRay v1.0.0 Helm chart.
  • Step 3: Build a KubeRay operator image (controller:latest)
  • Step 4: Upgrade KubeRay operator via helm upgrade (note that the CRD will not be updated)
    helm upgrade --install kuberay-operator kuberay/kuberay-operator --version 1.1.0 --set image.repository=controller,image.tag=latest
  • Step 5: Verify
    • Without this PR, the RayCluster's worker Pod will be deleted.
    • With this PR, the RayCluster's worker Pod will still be there.

@@ -796,10 +796,15 @@ func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv
}
}
// A replica can contain multiple hosts, so we need to calculate this based on the number of hosts per replica.
// If the user doesn't install the CRD with `NumOfHosts`, the zero value of `NumOfHosts`, which is 0, will be used.
// Hence, all workers will be deleted. Here, we set `NumOfHosts` to max(1, `NumOfHosts`) to avoid this situation.
if worker.NumOfHosts <= 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hindsight, NumOfHosts should have been a pointer so we can check for presence instead of using it's zero value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. At that time, I didn't take the mismatch between the CRD and the KubeRay image into consideration when reviewing the PR. Do you think we need to update the CRD in this PR? I think it will still be compatible with the YAML files for v1.1.0.

Copy link
Member

@andrewsykim andrewsykim Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. hard to say. It won't break any YAML but it would break Go references to the types.

I think we should also update the upgrade guidance to always upgrade the CRD before the binary. Similar to Kubernetes cluster upgrades where apiserver (and its builtin types) should always be upgraded first. I can try to update the docs for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed this bug doesn't occur when I upgrade with these two steps:

$ kubectl apply --server-side --force-conflicts -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.1.0"
$ helm upgrade kuberay-operator kuberay/kuberay-operator --version v1.1.0

@kevin85421 kevin85421 marked this pull request as ready for review April 17, 2024 21:31
@kevin85421 kevin85421 requested a review from jjyao April 17, 2024 21:31
@kevin85421 kevin85421 merged commit 20636f9 into ray-project:master Apr 18, 2024
@kevin85421 kevin85421 assigned kevin85421 and unassigned jjyao Apr 18, 2024
kevin85421 added a commit to kevin85421/kuberay that referenced this pull request May 6, 2024
kevin85421 added a commit to kevin85421/kuberay that referenced this pull request May 6, 2024
kevin85421 added a commit that referenced this pull request May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] All worker Pods are deleted if using KubeRay v1.0.0 CRD with KubeRay operator v1.1.0 image
3 participants