-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[Core] Remove RAY_LOG(ERROR
from LabelSelector and validate in prepare_label_selector
#52964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Remove RAY_LOG(ERROR
from LabelSelector and validate in prepare_label_selector
#52964
Conversation
cc: @dayshah, this is the follow-up to your comment on my previous PR. I have this marked as a draft currently because I'm still figuring out how to propagate the status from the |
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
This pull request has been automatically closed because there has been no more activity in the 14 days Please feel free to reopen or open a new pull request if you'd still like this to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for your contribution! |
@dayshah Looks like it is closed by the GitHub-bot. But wondering is it actually stale or it is still relevant for label scheduling errors propagation? |
not sure what the current state of the label scheduling code is, but I think the point of this was that the driver process shouldn't crash if the user put in bad labels. If this is still the case this should still be done @ryanaoleary |
@dayshah @MengjinYan Yeah that makes sense to me, sorry for the delay. I'll fix this PR and re-open it today. |
@MengjinYan @dayshah I fixed this PR with f2951c1, added some unit tests, and it's now ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the build is failing on some steps bc of -Wunused-result with ComputeResources
@@ -19,26 +19,46 @@ | |||
|
|||
namespace ray { | |||
|
|||
// Constructor to parse LabelSelector data type from proto. | |||
// Legacy constructor to parse LabelSelector data type from proto. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooc, what needs to be done to move away from the legacy constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was in use in another PR and didn't want to block on this one, but I actually don't see it being used anywhere so I think I can just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on offline discussion changed to still use the constructor but remove the error calls and do all the validation in prepare_label_selector
before the label selector map is passed to BuildTaskSpec
. Since the LabelSelector
is constructed directly inside the TaskSpec
when it's created, to propogate an error up to check_status
in the cython we'd need change each of the construction calls for tasks, bundles, and placement groups to a call like this:
static StatusOr<PlacementGroupSpecification> Create(
std::shared_ptr<rpc::PlacementGroupSpec> message) {
PlacementGroupSpecification spec(std::move(message));
RAY_RETURN_NOT_OK(spec.ConstructBundles());
return spec;
}
I think a better structure is to ensure we never get to a place where a TaskSpec can be constructed with an invalid spec, which is ensured through extending prepare_label_selector. This is consistent with how resources are prepared and passed to a Task:
Line 799 in 490994b
cdef int prepare_resources( |
src/ray/common/task/task_spec.h
Outdated
@@ -574,6 +574,9 @@ class TaskSpecification : public MessageWrapper<rpc::TaskSpec> { | |||
static absl::flat_hash_map<SchedulingClass, SchedulingClassDescriptor> sched_id_to_cls_ | |||
ABSL_GUARDED_BY(mutex_); | |||
static int next_sched_id_ ABSL_GUARDED_BY(mutex_); | |||
|
|||
FRIEND_TEST(TaskSpecTest, TestInvalidLabelSelectorPropagatesStatus); | |||
FRIEND_TEST(TaskSpecTest, TestValidLabelSelectorSucceeds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we test using a higher level public api?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already validate the label selector at the Python level:
ray/python/ray/_common/ray_option_utils.py
Line 129 in df375df
"label_selector": Option((dict, type(None)), lambda x: validate_label_selector(x)), |
InvalidArgument
wouldn't be able to be passed in to the API.
f2951c1
to
05f901d
Compare
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
05f901d
to
915dfaf
Compare
RAY_LOG(ERROR
from LabelSelector and validate in prepare_label_selector
@ryanaoleary please add |
|
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Should be fixed with f69d91a, I forgot to remove the c++ tests that checked that ray logged an error in the LabelSelector constructor, which this PR removes. |
…pare_label_selector` (#52964) This PR is a follow-up to this comment: #51901 (comment). This PR changes the cluster resource scheduler to propagate a Ray status to `ComputeResources` in `TaskSpecification` when the LabelSelector data type is initialized. This allows a task built with a malformed label selector to return an error as a more useful Python exception rather than crashing Ray components in the C++. #51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
…pare_label_selector` (#52964) This PR is a follow-up to this comment: #51901 (comment). This PR changes the cluster resource scheduler to propagate a Ray status to `ComputeResources` in `TaskSpecification` when the LabelSelector data type is initialized. This allows a task built with a malformed label selector to return an error as a more useful Python exception rather than crashing Ray components in the C++. #51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
…pare_label_selector` (#52964) This PR is a follow-up to this comment: #51901 (comment). This PR changes the cluster resource scheduler to propagate a Ray status to `ComputeResources` in `TaskSpecification` when the LabelSelector data type is initialized. This allows a task built with a malformed label selector to return an error as a more useful Python exception rather than crashing Ray components in the C++. #51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
…pare_label_selector` (ray-project#52964) This PR is a follow-up to this comment: ray-project#51901 (comment). This PR changes the cluster resource scheduler to propagate a Ray status to `ComputeResources` in `TaskSpecification` when the LabelSelector data type is initialized. This allows a task built with a malformed label selector to return an error as a more useful Python exception rather than crashing Ray components in the C++. ray-project#51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Michael Acar <michael.j.acar@gmail.com>
…pare_label_selector` (#52964) This PR is a follow-up to this comment: #51901 (comment). This PR changes the cluster resource scheduler to propagate a Ray status to `ComputeResources` in `TaskSpecification` when the LabelSelector data type is initialized. This allows a task built with a malformed label selector to return an error as a more useful Python exception rather than crashing Ray components in the C++. #51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…pare_label_selector` (#52964) This PR is a follow-up to this comment: #51901 (comment). This PR changes the cluster resource scheduler to propagate a Ray status to `ComputeResources` in `TaskSpecification` when the LabelSelector data type is initialized. This allows a task built with a malformed label selector to return an error as a more useful Python exception rather than crashing Ray components in the C++. #51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: sampan <sampan@anyscale.com>
Why are these changes needed?
This PR is a follow-up to this comment: #51901 (comment). This PR changes the cluster resource scheduler to propagate a Ray status to
ComputeResources
inTaskSpecification
when the LabelSelector data type is initialized. This allows a task built with a malformed label selector to return an error as a more useful Python exception rather than crashing Ray components in the C++.Related issue number
#51564
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.