-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Core][Autoscaler] Configure idleTimeoutSeconds per node type #48813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Autoscaler] Configure idleTimeoutSeconds per node type #48813
Conversation
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
TODO: @ryanaoleary I'll update this PR with doc/API changes and comments containing my manual testing process. |
Manual testing processKubeRay:
|
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: ryanaoleary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: ryanaoleary <113500783+ryanaoleary@users.noreply.github.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before merging this PR, would you mind:
|
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Autoscaler logs show available_node_types:
worker group with
worker group without
There was a CI error for a Ray Serve test but I think it's unrelated to this PR. |
cc @rickyyx this PR looks good to me. Would you mind taking a look? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a nit.
@@ -128,6 +128,8 @@ class NodeTypeConfig: | |||
min_worker_nodes: int | |||
# The maximal number of worker nodes can be launched for this node type. | |||
max_worker_nodes: int | |||
# Idle timeout seconds for worker nodes of this node type. | |||
idle_timeout_s: Optional[float] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we enforce it as integer with a cast when we add this? I see it being int as part of the schema
Or we could make this a float in the schema too. No preference over this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it to a number
type in the schema and then add a cast to float when we call idle_timeout_s = group_spec.get(IDLE_SECONDS_KEY)
, since I implemented it as an int in the RayCluster CRD for consistency with the other field: https://github.com/ray-project/kuberay/blob/925effe34022c72c41691c0b79d8d3051d4a1b77/ray-operator/apis/ray/v1/raycluster_types.go#L94
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the tests again and implemented this change in: 1bd8afb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for the great work!
@@ -1434,6 +1434,82 @@ def test_idle_termination_with_min_worker(min_workers): | |||
assert len(to_terminate) == 0 | |||
|
|||
|
|||
@pytest.mark.parametrize("node_type_idle_timeout_s", [1, 2, 10]) | |||
def test_idle_termination_with_node_type_idle_timeout(node_type_idle_timeout_s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
…oject#48813) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Adds `idle_timeout_s` as a field to `node_type_configs`, enabling the v2 autoscaler to configure idle termination per worker type. This PR depends on a change in KubeRay to the RayCluster CRD, since we want to support passing `idleTimeoutSeconds` to individual worker groups such that they can specify a custom idle duration: ray-project/kuberay#2558 ## Related issue number Closes ray-project#36888 <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com> Signed-off-by: ryanaoleary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Ricky Xu <xuchen727@hotmail.com> Signed-off-by: Connor Sanders <connor@elastiflow.com>
Why are these changes needed?
Adds
idle_timeout_s
as a field tonode_type_configs
, enabling the v2 autoscaler to configure idle termination per worker type.This PR depends on a change in KubeRay to the RayCluster CRD, since we want to support passing
idleTimeoutSeconds
to individual worker groups such that they can specify a custom idle duration: ray-project/kuberay#2558Related issue number
Closes #36888
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.