Skip to content

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Aug 12, 2024

Why are these changes needed?

  1. Remove the compatibility tests for Ray 2.7.0 from test-job.yaml, and add the test for Ray 2.34.0.
  2. Change almost all images from Ray 2.9.0 to Ray 2.34.0. However, I didn't change ray-ml:2.9.0 to ray-ml:2.34.0. We will revisit it later. The progress is tracked by Replace references of rayproject/ray-ml image #2292.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: kaihsun <kaihsun@anyscale.com>
@kevin85421 kevin85421 changed the title Update Ray image to 2.34.0 [release] Update Ray image to 2.34.0 Aug 12, 2024
@@ -85,7 +85,7 @@ spec:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.4.0
image: rayproject/ray-ml:2.9.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we update this to use 2.34.0.fc8721 or 2.34.0.deprecated?

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>
@@ -312,13 +312,13 @@ jobs:
run: go test ./pkg/... -race -parallel 4
working-directory: ${{env.working-directory}}

test-compatibility-2_7_0:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -38,7 +38,7 @@ applications:
args:
num_forwards: 0
runtime_env:
working_dir: https://github.com/ray-project/serve_workloads/archive/a2e2405f3117f1b4134b6924b5f44c4ff0710c00.zip
working_dir: https://github.com/ray-project/serve_workloads/archive/a9f184f4d9ddb7f9a578502ae106470f87a702ef.zip
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ray Serve introduced some breaking changes after Ray 2.9.0, and I opened a PR to fix the issue ray-project/serve_workloads#4.

autoscaling_config:
metrics_interval_s: 0.2
min_replicas: 1
max_replicas: 14
look_back_period_s: 2
downscale_delay_s: 5
upscale_delay_s: 2
target_num_ongoing_requests_per_replica: 1
target_ongoing_requests: 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graceful_shutdown_timeout_s: 5
max_concurrent_queries: 1000
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -285,6 +285,10 @@ def test_zero_downtime_rollout(self, set_up_cluster):
for cr_event in cr_events:
cr_event.trigger()


@pytest.mark.skip(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ray Serve introduces some breaking changes (see ray-service.autoscaler.yaml and rayservice_ha_test.go). The autoscaling test logic no longer works.

(I don't read the code. This is just my observation and guess, and the very limited information that I can get from proxy actor's logs)
Based on my observation and guess, Ray Serve will queue pending requests instead of failing them. The pending requests in the queue will affect the request for "signal" to be scheduled. When I set max_ongoing_requests to a small number (e.g. 1), the request for "signal" can be scheduled correctly and unlock some "block" requests. However, some pending requests will be scheduled to the replicas immediately, so the replicas can't be scaled down.

https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/serve/_private/replica_scheduler/pow_2_scheduler.py?L701

Signed-off-by: kaihsun <kaihsun@anyscale.com>
ray_actor_options:
num_cpus: 0.5
- name: signal
import_path: autoscaling.signaling:app
route_prefix: /signal
deployments:
- name: SignalDeployment
max_ongoing_requests: 1000
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each block request will also send a request to signal replica, so max_ongoing_requests should more than 20 (number of block requests). The default value changes from 100 to 5 in Ray 2.32.0.

@kevin85421 kevin85421 marked this pull request as ready for review August 14, 2024 01:15
@kevin85421 kevin85421 merged commit 678ec25 into ray-project:master Aug 14, 2024
27 checks passed
@@ -31,7 +31,7 @@ spec:
serviceAccountName: pytorch-distributed-training
containers:
- name: ray-head
image: rayproject/ray:2.9.0
image: rayproject/ray:2.34.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example no longer works because head node is on 2.34 and worker node is on 2.9 using ray-ml image. Should we revert this back to 2.9 or update the worker image to use the deprecated ray-ml images for now?

kevin85421 added a commit to kevin85421/kuberay that referenced this pull request Sep 29, 2024
kevin85421 added a commit that referenced this pull request Sep 29, 2024
kevin85421 added a commit that referenced this pull request Sep 29, 2024
kryanbeane added a commit to kryanbeane/kuberay that referenced this pull request Apr 8, 2025
* odh/dev:
  CARRY: Update upstream component_metadata location
  CARRY: Add upstream metadata to Kuberay manifests
  PATCH: CVE fix - Upgrade golang.org/x/net from 0.26.0 to 0.33.0
  CARRY: Updated GO version to 1.22 in odh release workflow
  CARRY: Updated KubeRay image to v1.2.2
  CARRY: Set FS group to MustRunAs for Ray SCC
  PATCH: Raise head pod memory limit to avoid test instability
  PATCH: Add SecurityContext to ray pods to function with restricted pod-security
  CARRY: Add delete patch to remove default namespace (ray-project#16)
  CARRY: Add workflow to release ODH/Kuberay with compiled test binaries
  PATCH: add aggregator role for admin and editor
  PATCH: CVE fix - Replace go-sqlite3 version to upgraded version
  PATCH: openshift kustomize overlay for odh operator
  [Telemetry][v1.2.2] Update KUBERAY_VERSION (ray-project#2417)
  [release v1.2.2] Update tags and versions (ray-project#2416)
  Revert "[release] Update Ray image to 2.34.0 (ray-project#2303)" (ray-project#2413) (ray-project#2415)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants