-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[Core][Autoscaler] Update the Autoscaler Resource Requests Data Model for Scheduling for Label Selector #51771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
LABEL_OPERATOR_IN = 1; | ||
// This is to support not equal or not in semantics. | ||
LABEL_OPERATOR_NOT_IN = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so exist or not exist are not supported at the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes from my understanding here: https://docs.google.com/document/d/1DKhPuZERlLbsU4TIQZZ6POCsm1pVMBgN_yn5r0OmaDI/edit?tab=t.0#heading=h.428qclpgsw2d
cc: @edoakes to confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg. The pr lg. Will let Janet take a look as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, we can add in the future
@edoakes Wondering if you can help to take a look as well? Thanks! |
src/ray/protobuf/autoscaler.proto
Outdated
// The node label requirements for the request. Multiple label selectors are for | ||
// fallback mechanism. When trying to find a node that satisfies the label | ||
// requirements, the first label selector should be tried first, if not found, | ||
// the second label selector should be tried, and so on. | ||
repeated LabelSelector label_selectors = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would this work with coordinated fallback between resources and labels? That is, overriding resource requirements depending on the label selector that's selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't necessarily need to make the change to support this immediately if there is a path to support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea is that if we want to do coordinated fallback between resources and labels, we can do similar things as the what we are doing with the placement group bundles:
- deprecate the current
resources_bundle
field - add a new proto message
ResourceSelector
with 1map<string, double>
field - add a new field with type
repeated ResourceSelector
in the currentResourceRequest
proto message
In this way, the newly added resource_selectors and the existing label_selectors can work together to support the coordinated fallback mechanism. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me -- we should align the interfaces for PG and regular tasks/actors as much as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should do it now, or can be added when we take on the work to do resource fallback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do that when we do the resource fallback work.
message BundleSelector { | ||
// The list of resource requests that should be allocated together. | ||
repeated ResourceRequest resource_requests = 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide an example of how this would be used for fallback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the note about the fallback mechanism for GangResourceRequest
is added inside the comment in GangResourceRequest
: https://github.com/ray-project/ray/pull/51771/files#diff-79033515bd7d0b4d966220d9c45290447e0aac8afd71b048a817b0d9bdfaa052R114-R117
// The bundle requests. Multiple bundle selectors are for fallback mechanism.
// When trying to find nodes that satisfies the bundle selector, the first bundle
// selector should be tried first, if not found, the second bundle selector should be
// tried, and so on.
I also just added the more notes in ResourceRequest
indicating that in bundle selector case, the fallback will be done on the bundle selector level and in resource request, there should only be 1 label selector.
Wondering if this is something you were looking for?
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
@edoakes All tests are passed. Wondering if you can help merge the PR? Thanks! |
Why are these changes needed?
This PR updated the resource request & gang resource request data model for communication between GCS and Autoscaler.
This is to enable GCS to send the associated node label requirements to Autoscaler to support the labeled scheduling feature.
The PR added the label selectors to the resource requests and Bundle selectors for the gang resource requests.
Reference: Public label based scheduling API
Related issue number
#51564
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.