Skip to content

Conversation

MengjinYan
Copy link
Collaborator

Why are these changes needed?

This PR updated the resource request & gang resource request data model for communication between GCS and Autoscaler.

This is to enable GCS to send the associated node label requirements to Autoscaler to support the labeled scheduling feature.

The PR added the label selectors to the resource requests and Bundle selectors for the gang resource requests.

Reference: Public label based scheduling API

Related issue number

#51564

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Comment on lines +58 to +60
LABEL_OPERATOR_IN = 1;
// This is to support not equal or not in semantics.
LABEL_OPERATOR_NOT_IN = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so exist or not exist are not supported at the first place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg. The pr lg. Will let Janet take a look as well

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, we can add in the future

@MengjinYan MengjinYan added the go add ONLY when ready to merge, run all tests label Mar 31, 2025
@MengjinYan
Copy link
Collaborator Author

@edoakes Wondering if you can help to take a look as well? Thanks!

Comment on lines 86 to 90
// The node label requirements for the request. Multiple label selectors are for
// fallback mechanism. When trying to find a node that satisfies the label
// requirements, the first label selector should be tried first, if not found,
// the second label selector should be tried, and so on.
repeated LabelSelector label_selectors = 3;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would this work with coordinated fallback between resources and labels? That is, overriding resource requirements depending on the label selector that's selected.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't necessarily need to make the change to support this immediately if there is a path to support it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea is that if we want to do coordinated fallback between resources and labels, we can do similar things as the what we are doing with the placement group bundles:

  1. deprecate the current resources_bundle field
  2. add a new proto message ResourceSelector with 1 map<string, double> field
  3. add a new field with type repeated ResourceSelector in the current ResourceRequest proto message

In this way, the newly added resource_selectors and the existing label_selectors can work together to support the coordinated fallback mechanism. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me -- we should align the interfaces for PG and regular tasks/actors as much as possible

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should do it now, or can be added when we take on the work to do resource fallback?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do that when we do the resource fallback work.

Comment on lines +101 to +104
message BundleSelector {
// The list of resource requests that should be allocated together.
repeated ResourceRequest resource_requests = 1;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide an example of how this would be used for fallback?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the note about the fallback mechanism for GangResourceRequest is added inside the comment in GangResourceRequest: https://github.com/ray-project/ray/pull/51771/files#diff-79033515bd7d0b4d966220d9c45290447e0aac8afd71b048a817b0d9bdfaa052R114-R117

  // The bundle requests. Multiple bundle selectors are for fallback mechanism.
  // When trying to find nodes that satisfies the bundle selector, the first bundle
  // selector should be tried first, if not found, the second bundle selector should be
  // tried, and so on.

I also just added the more notes in ResourceRequest indicating that in bundle selector case, the fallback will be done on the bundle selector level and in resource request, there should only be 1 label selector.

Wondering if this is something you were looking for?

MengjinYan and others added 2 commits March 31, 2025 11:54
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
@cszhu cszhu added the core Issues that should be addressed in Ray Core label Apr 3, 2025
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
@edoakes edoakes enabled auto-merge (squash) April 11, 2025 13:18
@github-actions github-actions bot disabled auto-merge April 11, 2025 21:42
@MengjinYan
Copy link
Collaborator Author

@edoakes All tests are passed. Wondering if you can help merge the PR? Thanks!

@edoakes edoakes merged commit b7b7645 into ray-project:master Apr 15, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-backlog core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants