Skip to content

Conversation

ryanaoleary
Copy link
Contributor

Why are these changes needed?

This PR adds PublishNotReadyAddresses: true to the headless service created by KubeRay when a RayCluster requests multi-host TPU nodes. When creating a RayService for multi-host TPU inference, TPU initialization currently times out because 1 or more of the workers might be unreachable until a proxy actor is running on that node. This PR unblocks multi-host inference on TPUs with a RayService, since TPU initialization requires worker-to-worker communication even if all proxy actors haven't started yet.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests - edited a running headless-worker-svc with PublishNotReadyAddresses: true and verified RayService deployment succeeded
    • This PR is not tested :(

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary
Copy link
Contributor Author

cc: @kevin85421 @andrewsykim

@@ -307,9 +307,10 @@ func BuildHeadlessServiceForRayCluster(rayCluster rayv1.RayCluster) (*corev1.Ser
Labels: labels,
},
Spec: corev1.ServiceSpec{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For another PR, add unit tests for BuildHeadlessServiceForRayCluster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and added a test case in this PR (c92042b) since it's a fairly small change.

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Copy link
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


actualSelector := svc.Spec.Selector[utils.RayClusterLabelKey]
expectedSelector := instanceForSvc.Name
if !reflect.DeepEqual(expectedSelector, actualSelector) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be easier to just construct the expected Service object and do a reflect.DeepEqual check against that and the returned object from BuildHeadlessServiceForRayCluster.

Maybe for a separate PR because this is just following the pattern of the other tests.

@andrewsykim
Copy link
Member

@kevin85421 PTAL

@kevin85421 kevin85421 merged commit b8f6d06 into ray-project:master Sep 12, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants