Skip to content

[Bug] Exec probes are causing high load on Ray pods #2355

@andrewsykim

Description

@andrewsykim

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

As part of Ray scalability testing (see https://github.com/ray-project/kuberay/tree/master/benchmark/perf-tests), I've noticed that some Ray pods experience high load and become unready.

Upon further investigation, it seems that the most of the CPU usage in the pods are from the exec probes.

Here's one example from a test. The Ray cluster is idle with no jobs but the pods are using full 2 CPUs allocated:

$ kubectl top pod
test-pkn9h-raycluster-5bqn8-head-4bgwf                              1877m        502Mi
test-pkn9h-raycluster-5bqn8-worker-small-group-q4jsq      1897m        326Mi
test-pkn9h-raycluster-5bqn8-worker-small-group-wjhsw      1920m        331Mi

When I exec into one of the pods run 'htop', I see that the exec probes is what's consuming most of the CPU:

Screenshot 2024-09-05 at 12 47 33 PM

Describing the pod shows the probes have been timing out:

  Normal   Created    2m11s                kubelet            Created container ray-workers
  Normal   Started    2m10s                kubelet            Started container ray-workers
  Warning  Unhealthy  38s (x12 over 110s)  kubelet            Readiness probe errored: command timed out: "bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success" timed out after 1s
  Warning  Unhealthy  31s (x10 over 90s)   kubelet            Liveness probe errored: command timed out: "bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success" timed out after 1s

Reproduction script

I haven't figured out exact steps to reproduce the issue, but it happens quite frequently in the new scalability tests.

Anything else

Potentially related to #2264

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

Labels

P0Critical issue that should be fixed ASAPbugSomething isn't workingray

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions