Skip to content

Conversation

ryanaoleary
Copy link
Contributor

Why are these changes needed?

This PR adds node labels to the Ray runtime context. This change will make it easier to verify that a task/actor is scheduled on a node with matching label requirements.

Related issue number

#51564

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

ryanaoleary and others added 2 commits May 10, 2025 02:30
@ryanaoleary ryanaoleary changed the title Add node labels to runtime context [Core[ Add node labels to runtime context May 10, 2025
@ryanaoleary ryanaoleary changed the title [Core[ Add node labels to runtime context [Core] Add node labels to runtime context May 10, 2025
@hainesmichaelc hainesmichaelc added the community-contribution Contributed by the community label May 12, 2025
@masoudcharkhabi masoudcharkhabi added core Issues that should be addressed in Ray Core stability labels May 12, 2025
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary
Copy link
Contributor Author

ryanaoleary commented May 13, 2025

I realized in this PR that there was no functionality to return the node labels for a given node from GCS, so in
401c5eb I added labels to the node_info that is returned by get_node here. I initially tried to use the node function _get_node_labels here, but this can't be used from the runtime context because it just returns whatever is in the RayParams (which are passed to the raylet command but don't remain saved here), and not what was initially passed to the raylet and then saved in GCS.

@ryanaoleary ryanaoleary requested review from edoakes and MengjinYan May 13, 2025 08:44
@ryanaoleary
Copy link
Contributor Author

cc: @MengjinYan @edoakes

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Comment on lines 598 to 600
# Return labels for this node from GCS.
node_id = global_worker.core_worker.get_current_node_id()
node_info = services.get_node(self.node.gcs_address, node_id.hex())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any other runtime_context fields that require RPCs to the GCS?

let's at least lazily initialize & then cache this value, so if users call it in a tight loop it won't require an RPC on every invocation (labels are static for now)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 62ff362. I don't see other runtime context fields calling to GCS, stuff like get_current_node_id call to the core worker, but I don't think we can get the node labels from there without calling to GCS/the node table. An alternative to calling services.get_node in the worker would be adding a field self._node_labels in the Node class and setting it here when it's initialized from node_info. I think the issue is that on connect_only calls to the Node, the RayParams are empty and so the value returned is None.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm it seems like this existing codepath relies on the ray params in the connect_only branch:

self._ray_params.node_manager_port = node_info["node_manager_port"]

Getting it from the Node class is more in line with how the rest of the runtime context API is implemented, so if we can do that in a way that's correct I'd prefer it. I'll leave it to you to briefly investigate if this works in all cases, and if not we can go with the current state of the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree in the Node class is better, since I think it'd be possible for a worker process to connect to a different node right (and with the cache implementation we'd only ever return the first node's labels). I didn't see at that point in the connect_only path that it uses the RayParams unless _plasma_store_socket_name is None. I still think it makes more sense to query the node_info here rather than in the worker, since node __init__ already makes these calls in both branches in the above location and here to set vars. In
2392754 I updated the logic to call get_node if connect_only and retrieve the node labels if set.

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary ryanaoleary requested a review from edoakes May 13, 2025 22:19
"market-type": "spot",
},
)
ray.init(address=cluster.address)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also add a condition that checks the labels from runtime context from the driver

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry just saw this comment, added a check in 516cc14 that the node labels from the driver runtime context should be empty

@edoakes
Copy link
Collaborator

edoakes commented May 22, 2025

@ryanaoleary is this ready for re-review?

ryanaoleary and others added 2 commits May 22, 2025 17:52
@ryanaoleary
Copy link
Contributor Author

@ryanaoleary is this ready for re-review?

Yeah this should be good to review again

@ryanaoleary ryanaoleary requested a review from edoakes May 22, 2025 17:54
@edoakes edoakes enabled auto-merge (squash) May 22, 2025 17:58
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label May 22, 2025
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@github-actions github-actions bot disabled auto-merge May 22, 2025 20:30
@edoakes edoakes merged commit 181f04d into ray-project:master May 22, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests stability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants