Skip to content

Conversation

ryanaoleary
Copy link
Contributor

Why are these changes needed?

This PR is a follow-up to this comment: #51901 (comment). This PR changes the cluster resource scheduler to propagate a Ray status to ComputeResources in TaskSpecification when the LabelSelector data type is initialized. This allows a task built with a malformed label selector to return an error as a more useful Python exception rather than crashing Ray components in the C++.

Related issue number

#51564

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ryanaoleary
Copy link
Contributor Author

cc: @dayshah, this is the follow-up to your comment on my previous PR. I have this marked as a draft currently because I'm still figuring out how to propagate the status from the TaskSpecification constructor call to submit_task in ray/python/ray/_raylet.pyx so that it's shown in the Python.

Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 13, 2025
Copy link

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

@github-actions github-actions bot closed this Jun 27, 2025
@MengjinYan
Copy link
Collaborator

@dayshah Looks like it is closed by the GitHub-bot. But wondering is it actually stale or it is still relevant for label scheduling errors propagation?

@dayshah
Copy link
Contributor

dayshah commented Jul 23, 2025

not sure what the current state of the label scheduling code is, but I think the point of this was that the driver process shouldn't crash if the user put in bad labels. If this is still the case this should still be done @ryanaoleary

@ryanaoleary
Copy link
Contributor Author

@dayshah @MengjinYan Yeah that makes sense to me, sorry for the delay. I'll fix this PR and re-open it today.

@MengjinYan MengjinYan reopened this Jul 23, 2025
@github-actions github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Jul 24, 2025
@ryanaoleary ryanaoleary marked this pull request as ready for review July 25, 2025 09:53
@ryanaoleary ryanaoleary requested a review from a team as a code owner July 25, 2025 09:53
@ryanaoleary
Copy link
Contributor Author

@MengjinYan @dayshah I fixed this PR with f2951c1, added some unit tests, and it's now ready for review.

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Jul 25, 2025
Copy link
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the build is failing on some steps bc of -Wunused-result with ComputeResources

@@ -19,26 +19,46 @@

namespace ray {

// Constructor to parse LabelSelector data type from proto.
// Legacy constructor to parse LabelSelector data type from proto.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooc, what needs to be done to move away from the legacy constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was in use in another PR and didn't want to block on this one, but I actually don't see it being used anywhere so I think I can just remove it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on offline discussion changed to still use the constructor but remove the error calls and do all the validation in prepare_label_selector before the label selector map is passed to BuildTaskSpec. Since the LabelSelector is constructed directly inside the TaskSpec when it's created, to propogate an error up to check_status in the cython we'd need change each of the construction calls for tasks, bundles, and placement groups to a call like this:

static StatusOr<PlacementGroupSpecification> Create(
    std::shared_ptr<rpc::PlacementGroupSpec> message) {
  PlacementGroupSpecification spec(std::move(message));
  RAY_RETURN_NOT_OK(spec.ConstructBundles());
  return spec;
}

I think a better structure is to ensure we never get to a place where a TaskSpec can be constructed with an invalid spec, which is ensured through extending prepare_label_selector. This is consistent with how resources are prepared and passed to a Task:

cdef int prepare_resources(
, doing the check here rather than in the c++.

@@ -574,6 +574,9 @@ class TaskSpecification : public MessageWrapper<rpc::TaskSpec> {
static absl::flat_hash_map<SchedulingClass, SchedulingClassDescriptor> sched_id_to_cls_
ABSL_GUARDED_BY(mutex_);
static int next_sched_id_ ABSL_GUARDED_BY(mutex_);

FRIEND_TEST(TaskSpecTest, TestInvalidLabelSelectorPropagatesStatus);
FRIEND_TEST(TaskSpecTest, TestValidLabelSelectorSucceeds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we test using a higher level public api?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already validate the label selector at the Python level:

"label_selector": Option((dict, type(None)), lambda x: validate_label_selector(x)),
, so I think the kinds of malformed labels that would cause the InvalidArgument wouldn't be able to be passed in to the API.

@ryanaoleary ryanaoleary force-pushed the propogate-invalid-argument-status branch from f2951c1 to 05f901d Compare August 1, 2025 22:00
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary ryanaoleary force-pushed the propogate-invalid-argument-status branch from 05f901d to 915dfaf Compare August 1, 2025 22:04
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary ryanaoleary changed the title [Core] Propogate InvalidArgument Status from LabelSelector Data Type [Core] Remove RAY_LOG(ERROR from LabelSelector and validate in prepare_label_selector Aug 1, 2025
@ryanaoleary ryanaoleary requested review from edoakes and dayshah August 1, 2025 22:15
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Aug 3, 2025
@edoakes edoakes enabled auto-merge (squash) August 3, 2025 18:44
@edoakes
Copy link
Collaborator

edoakes commented Aug 3, 2025

@ryanaoleary please add go label to start CI tests on your PRs when appropriate

@jjyao
Copy link
Collaborator

jjyao commented Aug 4, 2025

[2025-08-03T19:34:26Z] //src/ray/common/test:label_selector_test FAILED in 3 out of 3 in 0.3s

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@github-actions github-actions bot disabled auto-merge August 4, 2025 18:08
@ryanaoleary
Copy link
Contributor Author

[f69d91a](/ray-project/ray/pull/52964/commits/f69d91a63b8b656487b5dcf6edb6b2fdd68380bf)

Should be fixed with f69d91a, I forgot to remove the c++ tests that checked that ray logged an error in the LabelSelector constructor, which this PR removes.

@edoakes edoakes enabled auto-merge (squash) August 4, 2025 18:50
@edoakes edoakes merged commit 8fd232b into ray-project:master Aug 4, 2025
6 checks passed
elliot-barn pushed a commit that referenced this pull request Aug 4, 2025
…pare_label_selector` (#52964)

This PR is a follow-up to this comment:
#51901 (comment).
This PR changes the cluster resource scheduler to propagate a Ray status
to `ComputeResources` in `TaskSpecification` when the LabelSelector data
type is initialized. This allows a task built with a malformed label
selector to return an error as a more useful Python exception rather
than crashing Ray components in the C++.

#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
elliot-barn pushed a commit that referenced this pull request Aug 4, 2025
…pare_label_selector` (#52964)

This PR is a follow-up to this comment:
#51901 (comment).
This PR changes the cluster resource scheduler to propagate a Ray status
to `ComputeResources` in `TaskSpecification` when the LabelSelector data
type is initialized. This allows a task built with a malformed label
selector to return an error as a more useful Python exception rather
than crashing Ray components in the C++.

#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
kamil-kaczmarek pushed a commit that referenced this pull request Aug 4, 2025
…pare_label_selector` (#52964)

This PR is a follow-up to this comment:
#51901 (comment).
This PR changes the cluster resource scheduler to propagate a Ray status
to `ComputeResources` in `TaskSpecification` when the LabelSelector data
type is initialized. This allows a task built with a malformed label
selector to return an error as a more useful Python exception rather
than crashing Ray components in the C++.

#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
mjacar pushed a commit to mjacar/ray that referenced this pull request Aug 5, 2025
…pare_label_selector` (ray-project#52964)

This PR is a follow-up to this comment:
ray-project#51901 (comment).
This PR changes the cluster resource scheduler to propagate a Ray status
to `ComputeResources` in `TaskSpecification` when the LabelSelector data
type is initialized. This allows a task built with a malformed label
selector to return an error as a more useful Python exception rather
than crashing Ray components in the C++.

ray-project#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
elliot-barn pushed a commit that referenced this pull request Aug 5, 2025
…pare_label_selector` (#52964)

This PR is a follow-up to this comment:
#51901 (comment).
This PR changes the cluster resource scheduler to propagate a Ray status
to `ComputeResources` in `TaskSpecification` when the LabelSelector data
type is initialized. This allows a task built with a malformed label
selector to return an error as a more useful Python exception rather
than crashing Ray components in the C++.

#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
sampan-s-nayak pushed a commit that referenced this pull request Aug 12, 2025
…pare_label_selector` (#52964)

This PR is a follow-up to this comment:
#51901 (comment).
This PR changes the cluster resource scheduler to propagate a Ray status
to `ComputeResources` in `TaskSpecification` when the LabelSelector data
type is initialized. This allows a task built with a malformed label
selector to return an error as a more useful Python exception rather
than crashing Ray components in the C++.

#51564

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: sampan <sampan@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants