-
Notifications
You must be signed in to change notification settings - Fork 604
Description
Search before asking
- I had searched in the issues and found no similar feature requirement.
Description
The RayCluster controller handles edge cases where multiple head Pods are created. This is possible in some extreme cases, even though we have already implemented expectations
.
} else if len(headPods.Items) > 1 { |
Having multiple head Pods is a fatal error for a Ray cluster. If there is more than one Pod behind the head service, worker Pods may connect to different head Pods if the underlying connection uses the service name instead of the virtual IP. If a worker Pod connects to different head Pod, it may be killed by the head Pod.
Currently, the name of the head Pod follows the format raycluster-kuberay-head-xxxxx
(raycluster-kuberay
is the name of the RayCluster CR). The Pod name is undeterministic.
Two action items:
-
Make the name of the head Pod deterministic (e.g.,
raycluster-kuberay-head-xxxxx
→raycluster-kuberay-head
) so that the K8s API server rejects the creation request if extreme cases occur. -
Remove
kuberay/ray-operator/controllers/ray/raycluster_controller.go
Lines 736 to 754 in 0d848f9
} else if len(headPods.Items) > 1 { logger.Info("reconcilePods: Found more than one head Pods; deleting extra head Pods.", "nHeadPods", len(headPods.Items)) // TODO (kevin85421): In-place update may not be a good idea. itemLength := len(headPods.Items) for index := 0; index < itemLength; index++ { if headPods.Items[index].Status.Phase == corev1.PodRunning || headPods.Items[index].Status.Phase == corev1.PodPending { headPods.Items[index] = headPods.Items[len(headPods.Items)-1] // Replace healthy pod at index i with the last element from the list of pods to delete. headPods.Items = headPods.Items[:len(headPods.Items)-1] // Truncate slice. itemLength-- } } // delete all the extra head pod pods for _, extraHeadPodToDelete := range headPods.Items { if err := r.Delete(ctx, &extraHeadPodToDelete); err != nil { return errstd.Join(utils.ErrFailedDeleteHeadPod, err) } r.rayClusterScaleExpectation.ExpectScalePod(extraHeadPodToDelete.Namespace, instance.Name, expectations.HeadGroup, extraHeadPodToDelete.Name, expectations.Delete) } }
Use case
No response
Related issues
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!