Skip to content

Conversation

ryanaoleary
Copy link
Contributor

Why are these changes needed?

This PR updates the GCS state manager to append the labels set on a Node and retrieved from GCS to the dynamic_node_labels on the NodeState passed to the autoscaler. This PR also updates the V2 autoscaler to consider labels when calling _compute_score to schedule a ResourceRequest on a Node, assigning a higher score to a node type with labels that satisfy a LabelSelector.

Related issue number

#51564

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…pute_score

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary ryanaoleary requested a review from a team as a code owner June 5, 2025 10:38
@ryanaoleary
Copy link
Contributor Author

@MengjinYan I ended up implementing the logic to pass labels from GCS to the autoscaler and consider labels in the autoscaler when scheduling in the same PR because it made it easier to test the change.

Copy link
Collaborator

@can-anyscale can-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some tests are failing, could be related @ryanaoleary

ryanaoleary and others added 2 commits June 13, 2025 12:46
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Copy link
Collaborator

@MengjinYan MengjinYan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we should also add the logic to parse the LabelSelector from the GCS GetClusterStatusReply in autoscaler to populate the label selector information.

Also, is it possible to add end to end test with label selector?

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary and others added 5 commits June 25, 2025 08:27
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@MengjinYan
Copy link
Collaborator

@rueian Can you take a look at the PR as well from autoscaler's perspective to see if we missed anything? Thanks!

@MengjinYan MengjinYan requested a review from rueian July 2, 2025 21:47
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we should also change the _sort_resource_request logic in _try_schedule to add labels to the sorting mechanism, as there can resource requests with exact same resource requirements but different label selectors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, done in 8cb1816. I had it sort by the length of the constraints of the first selector, similar to how we sort based on the length of the placement constraints.

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@@ -493,6 +493,8 @@ message ResourceDemand {
// The number of requests of this shape still queued in CoreWorkers that this
// raylet knows about.
int64 backlog_size = 4;
// The label selector constraints for this Resource shape on a node.
repeated LabelSelector label_selectors = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a repeated LabelSelectorConstraint field in the LabelSelector. Why do we need repeated here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LabelSelectorConstraints are the conditions associated with one label selector with AND semantics (i.e. we require a node to have labels accelerator-type="TPU" and market-type="SPOT" if those two respective constraints are specified). The repeated label selector is so that we can support fallback with label selectors (i.e. the user specifies two separate label selectors with their own constraint(s), and we fallback to the second one if the first is unsatisfiable).

ryanaoleary and others added 4 commits July 9, 2025 03:28
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Jul 23, 2025
@ray-gardener ray-gardener bot added community-contribution Contributed by the community core Issues that should be addressed in Ray Core labels Jul 23, 2025
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Copy link
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally just minor nits, not really blockers


namespace std {
template <>
struct hash<ray::LabelSelector> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we using std sets/maps with this somewhere instead of absl ones. Would prefer sticking to absl and just defining the absl hash functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using std::hash in task_spec.h here for the scheduling class descriptor. I think since we're using absl::flat_hash_map in that class I can just go ahead and switch it all to use absl hash functions instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 9b3bbbe.

ryanaoleary and others added 10 commits July 28, 2025 21:44
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary ryanaoleary requested a review from dayshah July 29, 2025 09:21
ryanaoleary and others added 4 commits July 29, 2025 09:56
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@jjyao jjyao changed the title Update V2 Autoscaler to support scheduling using Node labels and LabelSelector API [Core] Update V2 Autoscaler to support scheduling using Node labels and LabelSelector API Jul 30, 2025
@jjyao jjyao merged commit ffffcd4 into ray-project:master Jul 30, 2025
5 checks passed
krishnakalyan3 pushed a commit to krishnakalyan3/ray that referenced this pull request Jul 30, 2025
…nd LabelSelector API (ray-project#53578)

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: Krishna Kalyan <krishnakalyan3@gmail.com>
avibasnet31 pushed a commit to avibasnet31/ray that referenced this pull request Aug 2, 2025
…nd LabelSelector API (ray-project#53578)

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: avigyabb <avigyabb@stanford.edu>
avibasnet31 pushed a commit to avibasnet31/ray that referenced this pull request Aug 2, 2025
…nd LabelSelector API (ray-project#53578)

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: avigyabb <avigyabb@stanford.edu>
kamil-kaczmarek pushed a commit that referenced this pull request Aug 4, 2025
…nd LabelSelector API (#53578)

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants