-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[Core] Add node labels to runtime context #52926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
I realized in this PR that there was no functionality to return the node labels for a given node from GCS, so in |
cc: @MengjinYan @edoakes |
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
python/ray/_private/worker.py
Outdated
# Return labels for this node from GCS. | ||
node_id = global_worker.core_worker.get_current_node_id() | ||
node_info = services.get_node(self.node.gcs_address, node_id.hex()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any other runtime_context fields that require RPCs to the GCS?
let's at least lazily initialize & then cache this value, so if users call it in a tight loop it won't require an RPC on every invocation (labels are static for now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 62ff362. I don't see other runtime context fields calling to GCS, stuff like get_current_node_id
call to the core worker, but I don't think we can get the node labels from there without calling to GCS/the node table. An alternative to calling services.get_node
in the worker would be adding a field self._node_labels
in the Node class and setting it here when it's initialized from node_info
. I think the issue is that on connect_only
calls to the Node, the RayParams are empty and so the value returned is None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm it seems like this existing codepath relies on the ray params in the connect_only
branch:
ray/python/ray/_private/node.py
Line 293 in 7b6a1af
self._ray_params.node_manager_port = node_info["node_manager_port"] |
Getting it from the Node
class is more in line with how the rest of the runtime context API is implemented, so if we can do that in a way that's correct I'd prefer it. I'll leave it to you to briefly investigate if this works in all cases, and if not we can go with the current state of the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree in the Node class is better, since I think it'd be possible for a worker process to connect to a different node right (and with the cache implementation we'd only ever return the first node's labels). I didn't see at that point in the connect_only
path that it uses the RayParams unless _plasma_store_socket_name
is None. I still think it makes more sense to query the node_info
here rather than in the worker, since node __init__
already makes these calls in both branches in the above location and here to set vars. In
2392754 I updated the logic to call get_node
if connect_only
and retrieve the node labels if set.
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
"market-type": "spot", | ||
}, | ||
) | ||
ray.init(address=cluster.address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also add a condition that checks the labels from runtime context from the driver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry just saw this comment, added a check in 516cc14 that the node labels from the driver runtime context should be empty
@ryanaoleary is this ready for re-review? |
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Yeah this should be good to review again |
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Why are these changes needed?
This PR adds node labels to the Ray runtime context. This change will make it easier to verify that a task/actor is scheduled on a node with matching label requirements.
Related issue number
#51564
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.