Skip to content

[Refactor] Make head Pod name deterministic #3013

@kevin85421

Description

@kevin85421

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

The RayCluster controller handles edge cases where multiple head Pods are created. This is possible in some extreme cases, even though we have already implemented expectations.

} else if len(headPods.Items) > 1 {

Having multiple head Pods is a fatal error for a Ray cluster. If there is more than one Pod behind the head service, worker Pods may connect to different head Pods if the underlying connection uses the service name instead of the virtual IP. If a worker Pod connects to different head Pod, it may be killed by the head Pod.

Currently, the name of the head Pod follows the format raycluster-kuberay-head-xxxxx (raycluster-kuberay is the name of the RayCluster CR). The Pod name is undeterministic.

Two action items:

  1. Make the name of the head Pod deterministic (e.g., raycluster-kuberay-head-xxxxxraycluster-kuberay-head) so that the K8s API server rejects the creation request if extreme cases occur.

  2. Remove

    } else if len(headPods.Items) > 1 {
    logger.Info("reconcilePods: Found more than one head Pods; deleting extra head Pods.", "nHeadPods", len(headPods.Items))
    // TODO (kevin85421): In-place update may not be a good idea.
    itemLength := len(headPods.Items)
    for index := 0; index < itemLength; index++ {
    if headPods.Items[index].Status.Phase == corev1.PodRunning || headPods.Items[index].Status.Phase == corev1.PodPending {
    headPods.Items[index] = headPods.Items[len(headPods.Items)-1] // Replace healthy pod at index i with the last element from the list of pods to delete.
    headPods.Items = headPods.Items[:len(headPods.Items)-1] // Truncate slice.
    itemLength--
    }
    }
    // delete all the extra head pod pods
    for _, extraHeadPodToDelete := range headPods.Items {
    if err := r.Delete(ctx, &extraHeadPodToDelete); err != nil {
    return errstd.Join(utils.ErrFailedDeleteHeadPod, err)
    }
    r.rayClusterScaleExpectation.ExpectScalePod(extraHeadPodToDelete.Namespace, instance.Name, expectations.HeadGroup, extraHeadPodToDelete.Name, expectations.Delete)
    }
    }

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions