-
Notifications
You must be signed in to change notification settings - Fork 616
Closed
ray-project/ray
#47721Labels
P0Critical issue that should be fixed ASAPCritical issue that should be fixed ASAPbugSomething isn't workingSomething isn't workingray
Description
Search before asking
- I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
As part of Ray scalability testing (see https://github.com/ray-project/kuberay/tree/master/benchmark/perf-tests), I've noticed that some Ray pods experience high load and become unready.
Upon further investigation, it seems that the most of the CPU usage in the pods are from the exec probes.
Here's one example from a test. The Ray cluster is idle with no jobs but the pods are using full 2 CPUs allocated:
$ kubectl top pod
test-pkn9h-raycluster-5bqn8-head-4bgwf 1877m 502Mi
test-pkn9h-raycluster-5bqn8-worker-small-group-q4jsq 1897m 326Mi
test-pkn9h-raycluster-5bqn8-worker-small-group-wjhsw 1920m 331Mi
When I exec into one of the pods run 'htop', I see that the exec probes is what's consuming most of the CPU:
Describing the pod shows the probes have been timing out:
Normal Created 2m11s kubelet Created container ray-workers
Normal Started 2m10s kubelet Started container ray-workers
Warning Unhealthy 38s (x12 over 110s) kubelet Readiness probe errored: command timed out: "bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success" timed out after 1s
Warning Unhealthy 31s (x10 over 90s) kubelet Liveness probe errored: command timed out: "bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success" timed out after 1s
Reproduction script
I haven't figured out exact steps to reproduce the issue, but it happens quite frequently in the new scalability tests.
Anything else
Potentially related to #2264
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Metadata
Metadata
Assignees
Labels
P0Critical issue that should be fixed ASAPCritical issue that should be fixed ASAPbugSomething isn't workingSomething isn't workingray