-
Notifications
You must be signed in to change notification settings - Fork 603
[Bug] All worker Pods are deleted if using KubeRay v1.0.0 CRD with KubeRay operator v1.1.0 image #2087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] All worker Pods are deleted if using KubeRay v1.0.0 CRD with KubeRay operator v1.1.0 image #2087
Conversation
@@ -796,10 +796,15 @@ func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv | |||
} | |||
} | |||
// A replica can contain multiple hosts, so we need to calculate this based on the number of hosts per replica. | |||
// If the user doesn't install the CRD with `NumOfHosts`, the zero value of `NumOfHosts`, which is 0, will be used. | |||
// Hence, all workers will be deleted. Here, we set `NumOfHosts` to max(1, `NumOfHosts`) to avoid this situation. | |||
if worker.NumOfHosts <= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In hindsight, NumOfHosts
should have been a pointer so we can check for presence instead of using it's zero value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. At that time, I didn't take the mismatch between the CRD and the KubeRay image into consideration when reviewing the PR. Do you think we need to update the CRD in this PR? I think it will still be compatible with the YAML files for v1.1.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. hard to say. It won't break any YAML but it would break Go references to the types.
I think we should also update the upgrade guidance to always upgrade the CRD before the binary. Similar to Kubernetes cluster upgrades where apiserver (and its builtin types) should always be upgraded first. I can try to update the docs for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed this bug doesn't occur when I upgrade with these two steps:
$ kubectl apply --server-side --force-conflicts -k "github.com/ray-project/kuberay/ray-operator/config/crd?ref=v1.1.0"
$ helm upgrade kuberay-operator kuberay/kuberay-operator --version v1.1.0
…beRay operator v1.1.0 image (ray-project#2087)
…beRay operator v1.1.0 image (ray-project#2087)
Why are these changes needed?
Some users only upgrade the KubeRay image without upgrading the CRD. For example, when a user upgrades the KubeRay operator from v1.0.0 to v1.1.0 without upgrading the CRD, the KubeRay operator will use the zero value of
NumOfHosts
in the CRD. Hence, all worker Pods will be deleted. This PR ensures thatNumOfHosts
will always be larger than or equal to 1.Related issue number
Closes #2084
Checks
controller:latest
)helm upgrade
(note that the CRD will not be updated)