-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[Core] Update V2 Autoscaler to support scheduling using Node labels and LabelSelector API #53578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…pute_score Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@MengjinYan I ended up implementing the logic to pass labels from GCS to the autoscaler and consider labels in the autoscaler when scheduling in the same PR because it made it easier to test the change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some tests are failing, could be related @ryanaoleary
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we should also add the logic to parse the LabelSelector from the GCS GetClusterStatusReply
in autoscaler to populate the label selector information.
Also, is it possible to add end to end test with label selector?
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@rueian Can you take a look at the PR as well from autoscaler's perspective to see if we missed anything? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we should also change the _sort_resource_request
logic in _try_schedule
to add labels to the sorting mechanism, as there can resource requests with exact same resource requirements but different label selectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, done in 8cb1816. I had it sort by the length of the constraints of the first selector, similar to how we sort based on the length of the placement constraints.
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@@ -493,6 +493,8 @@ message ResourceDemand { | |||
// The number of requests of this shape still queued in CoreWorkers that this | |||
// raylet knows about. | |||
int64 backlog_size = 4; | |||
// The label selector constraints for this Resource shape on a node. | |||
repeated LabelSelector label_selectors = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a repeated LabelSelectorConstraint
field in the LabelSelector
. Why do we need repeated
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LabelSelectorConstraint
s are the conditions associated with one label selector with AND semantics (i.e. we require a node to have labels accelerator-type="TPU" and market-type="SPOT"
if those two respective constraints are specified). The repeated label selector is so that we can support fallback with label selectors (i.e. the user specifies two separate label selectors with their own constraint(s), and we fallback to the second one if the first is unsatisfiable).
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally just minor nits, not really blockers
src/ray/gcs/gcs_server/test/gcs_autoscaler_state_manager_test.cc
Outdated
Show resolved
Hide resolved
|
||
namespace std { | ||
template <> | ||
struct hash<ray::LabelSelector> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we using std sets/maps with this somewhere instead of absl ones. Would prefer sticking to absl and just defining the absl hash functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using std::hash
in task_spec.h here for the scheduling class descriptor. I think since we're using absl::flat_hash_map
in that class I can just go ahead and switch it all to use absl hash functions instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 9b3bbbe.
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
…nd LabelSelector API (ray-project#53578) Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: Krishna Kalyan <krishnakalyan3@gmail.com>
…nd LabelSelector API (ray-project#53578) Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: avigyabb <avigyabb@stanford.edu>
…nd LabelSelector API (ray-project#53578) Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: avigyabb <avigyabb@stanford.edu>
…nd LabelSelector API (#53578) Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Why are these changes needed?
This PR updates the GCS state manager to append the labels set on a Node and retrieved from GCS to the
dynamic_node_labels
on the NodeState passed to the autoscaler. This PR also updates the V2 autoscaler to consider labels when calling_compute_score
to schedule aResourceRequest
on a Node, assigning a higher score to a node type with labels that satisfy a LabelSelector.Related issue number
#51564
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.