-
Notifications
You must be signed in to change notification settings - Fork 610
[release] Update Ray image to 2.34.0 #2303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: kaihsun <kaihsun@anyscale.com>
@@ -85,7 +85,7 @@ spec: | |||
spec: | |||
containers: | |||
- name: ray-worker | |||
image: rayproject/ray-ml:2.4.0 | |||
image: rayproject/ray-ml:2.9.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we update this to use 2.34.0.fc8721
or 2.34.0.deprecated
?
Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>
@@ -312,13 +312,13 @@ jobs: | |||
run: go test ./pkg/... -race -parallel 4 | |||
working-directory: ${{env.working-directory}} | |||
|
|||
test-compatibility-2_7_0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -38,7 +38,7 @@ applications: | |||
args: | |||
num_forwards: 0 | |||
runtime_env: | |||
working_dir: https://github.com/ray-project/serve_workloads/archive/a2e2405f3117f1b4134b6924b5f44c4ff0710c00.zip | |||
working_dir: https://github.com/ray-project/serve_workloads/archive/a9f184f4d9ddb7f9a578502ae106470f87a702ef.zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ray Serve introduced some breaking changes after Ray 2.9.0, and I opened a PR to fix the issue ray-project/serve_workloads#4.
autoscaling_config: | ||
metrics_interval_s: 0.2 | ||
min_replicas: 1 | ||
max_replicas: 14 | ||
look_back_period_s: 2 | ||
downscale_delay_s: 5 | ||
upscale_delay_s: 2 | ||
target_num_ongoing_requests_per_replica: 1 | ||
target_ongoing_requests: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
graceful_shutdown_timeout_s: 5 | ||
max_concurrent_queries: 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -285,6 +285,10 @@ def test_zero_downtime_rollout(self, set_up_cluster): | |||
for cr_event in cr_events: | |||
cr_event.trigger() | |||
|
|||
|
|||
@pytest.mark.skip( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ray Serve introduces some breaking changes (see ray-service.autoscaler.yaml
and rayservice_ha_test.go
). The autoscaling test logic no longer works.
(I don't read the code. This is just my observation and guess, and the very limited information that I can get from proxy actor's logs)
Based on my observation and guess, Ray Serve will queue pending requests instead of failing them. The pending requests in the queue will affect the request for "signal" to be scheduled. When I set max_ongoing_requests
to a small number (e.g. 1), the request for "signal" can be scheduled correctly and unlock some "block" requests. However, some pending requests will be scheduled to the replicas immediately, so the replicas can't be scaled down.
Signed-off-by: kaihsun <kaihsun@anyscale.com>
ray_actor_options: | ||
num_cpus: 0.5 | ||
- name: signal | ||
import_path: autoscaling.signaling:app | ||
route_prefix: /signal | ||
deployments: | ||
- name: SignalDeployment | ||
max_ongoing_requests: 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each block
request will also send a request to signal
replica, so max_ongoing_requests
should more than 20 (number of block requests). The default value changes from 100 to 5 in Ray 2.32.0.
@@ -31,7 +31,7 @@ spec: | |||
serviceAccountName: pytorch-distributed-training | |||
containers: | |||
- name: ray-head | |||
image: rayproject/ray:2.9.0 | |||
image: rayproject/ray:2.34.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example no longer works because head node is on 2.34 and worker node is on 2.9 using ray-ml image. Should we revert this back to 2.9 or update the worker image to use the deprecated ray-ml images for now?
This reverts commit 678ec25.
* odh/dev: CARRY: Update upstream component_metadata location CARRY: Add upstream metadata to Kuberay manifests PATCH: CVE fix - Upgrade golang.org/x/net from 0.26.0 to 0.33.0 CARRY: Updated GO version to 1.22 in odh release workflow CARRY: Updated KubeRay image to v1.2.2 CARRY: Set FS group to MustRunAs for Ray SCC PATCH: Raise head pod memory limit to avoid test instability PATCH: Add SecurityContext to ray pods to function with restricted pod-security CARRY: Add delete patch to remove default namespace (ray-project#16) CARRY: Add workflow to release ODH/Kuberay with compiled test binaries PATCH: add aggregator role for admin and editor PATCH: CVE fix - Replace go-sqlite3 version to upgraded version PATCH: openshift kustomize overlay for odh operator [Telemetry][v1.2.2] Update KUBERAY_VERSION (ray-project#2417) [release v1.2.2] Update tags and versions (ray-project#2416) Revert "[release] Update Ray image to 2.34.0 (ray-project#2303)" (ray-project#2413) (ray-project#2415)
Why are these changes needed?
test-job.yaml
, and add the test for Ray 2.34.0.ray-ml:2.9.0
toray-ml:2.34.0
. We will revisit it later. The progress is tracked by Replace references of rayproject/ray-ml image #2292.Related issue number
Checks