-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Description
What happened + What you expected to happen
This may be similar to /issues/36936, but I think the behavior of # of threads upon repeated ray.init() calls and the sensitivity of changing any of the thread environment variables (under a very high ulimit) makes this revisitable?
What happened:
When starting a ray cluster on a single compute node with:
Ray start –head –node-ip-address=$(hostname) –temp-dir=/tmp/ray
And on multiple nodes with (on each of the worker nodes)
Ray start –worker –node-ip-address=$(hostname) address=HEAD_IP_ADDRESS
And after the cluster successfully starting,
Calling “ray.init(address=”auto”) in a python script, and also similarly trying to use “vllm serve” causes hanging/crashing.
Upon looking at the /tmp/ray/session_latest/logs/ directory, I can find this stack trace in each of the worker .err files.
[2025-06-27 19:28:31,558 E 927366 929116] logging.cc:112: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]
[2025-06-27 19:28:31,872 E 927366 929116] logging.cc:119: Stack trace:
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x1419d1a) [0x154b5b055d1a] ray::operator<<()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x141d2f2) [0x154b5b0592f2] ray::TerminateHandler()
(path_to_my_conda_env)/bin/../lib/libstdc++.so.6(+0xb643c) [0x154b5ce9943c] __cxxabiv1::__terminate()
(path_to_my_conda_env)/bin/../lib/libstdc++.so.6(+0xb648e) [0x154b5ce9948e] __cxxabiv1::__unexpected()
(path_to_my_conda_env)/bin/../lib/libstdc++.so.6(__cxa_rethrow+0) [0x154b5ce99680] __cxa_rethrow
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x70b42e) [0x154b5a34742e] boost::throw_exception<>()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f8f6b) [0x154b5b034f6b] boost::asio::detail::do_throw_error()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f998b) [0x154b5b03598b] boost::asio::detail::posix_thread::start_thread()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f9dec) [0x154b5b035dec] boost::asio::thread_pool::thread_pool()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0xd228f4) [0x154b5a95e8f4] ray::rpc::(anonymous namespace)::_GetServerCallExecutor()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.s
o(_ZN3ray3rpc21GetServerCallExecutorEv+0x9) [0x154b5a95e989] ray::rpc::GetServerCallExecutor()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvN3ray6StatusESt8functionIFvvEES4_EZNS0_3rpc14ServerCallImplINS6_24CoreWorkerServiceHandlerENS6_25GetCoreWorkerStatsRequestENS6_23GetCoreWorkerStatsReplyELNS6_8AuthTypeE0EE17HandleRequestImplEbEUlS1_S4_S4_E0_E9_M_invokeERKSt9_Any_dataOS1_OS4_SJ_+0x12b) [0x154b5a59124b] std::_Function_handler<>::_M_invoke()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker24HandleGetCoreWorkerStatsENS_3rpc25GetCoreWorkerStatsRequestEPNS2_23GetCoreWorkerStatsReplyESt8functionIFvNS_6StatusES6_IFvvEES9_EE+0x839) [0x154b5a5c3789] ray::core::CoreWorker::HandleGetCoreWorkerStats()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray3rpc14ServerCallImplINS0_24CoreWorkerServiceHandlerENS0_25GetCoreWorkerStatsRequestENS0_23GetCoreWorkerStatsReplyELNS0_8AuthTypeE0EE17HandleRequestImplEb+0x104) [0x154b5a5a9744] ray::rpc::ServerCallImpl<>::HandleRequestImpl()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0xd9d968) [0x154b5a9d9968] EventTracker::RecordExecution()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0xd4892e) [0x154b5a98492e] std::_Function_handler<>::_M_invoke()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0xd48da6) [0x154b5a984da6] boost::asio::detail::completion_handler<>::do_complete()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f65fb) [0x154b5b0325fb] boost::asio::detail::scheduler::do_run_one()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f7f79) [0x154b5b033f79] boost::asio::detail::scheduler::run()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f8682) [0x154b5b034682] boost::asio::io_context::run()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0x91) [0x154b5a520b61] ray::core::CoreWorker::RunIOService()
(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0xcbc550) [0x154b5a8f8550] thread_proxy
/lib64/libpthread.so.0(+0xa6ea) [0x154b5d8406ea] start_thread
/lib64/libc.so.6(clone+0x41) [0x154b5d60050f] clone
What I’ve Tried:
Setting these environment variables to the following values:
export RAY_gcs_server_rpc_server_thread_num=1
export RAY_gcs_server_rpc_client_thread_num=1
export RAY_num_server_call_thread=1
export OMP_NUM_THREADS=1
After these environment variables are set, a very simple python script that just calls ray.init(address=”auto”) ran without crashing or hanging, but larger programs like “vllm serve” could not run the first time successfully.
Another thing I’ve noticed is that if the first instance of ray.init(address=”auto”) on the cluster runs or fails, the total number of system threads:
ps -eo nlwp | tail -n +2 | awk '{ num_threads += $1 } END { print num_threads }'
Decreases with each subsequent ray.init(address=”auto”) call.
For example, here are some values that I found with the above variables set:
Right after cluster started: 1917 threads
1 run of ray.init(address=”auto”): 4617 threads
2 run: 4205
3 run: 3685
4 run: 3161
5 run: 2641
Eventually, this bottoms out at the original number of threads that were running on a fresh cluster.
Using this, I was able to get “vllm serve” to run by first reducing the total number of threads by calling ray.init(address=”auto”) a couple of times beforehand.
However, the performance was significantly slower than expected, likely due to the severely limited amount of threads.
When I try to unlimit the number of threads even slightly, such as by changing one of the variables from 1 to 2, the crashes happen more frequently (i.e., the first N+x ray.init() calls fail, instead of N), and if I don’t limit the threads at all, the ray.init() call will hang indefinitely, not responding to SIGINT, and also resists my previous hacky strategy of reducing the total threads by calling ray.init() several times.
I’ve checked the maximum number of user processes (ulimit -u), which is 2060471, and the stack size, which is unlimited, so it is strange that I am hitting a thread creation error.
What I Expected to Happen:
Vllm serve, and other processes that call ray.init(), should be able to start without encountering thread creation issues, or at least run with threads limited to a reasonable number for speed.
Versions / Dependencies
Versions / Dependencies:
Python 3.11.8
Ray 2.46.0
OS / Hardware:
SUSE Linux Enterprise Server 15 SP5
Argonne Polaris Compute Node:
1 AMD EPYC "Milan" processor; (64 cores)
4 NVIDIA A100 GPUs; Unified
Memory Architecture; 2 fabric
endpoints; 2 NVMe SSDs
Reproduction script
On head node:
ray start --head
python
import ray
ray.init(address="auto")
Issue Severity
High: It blocks me from completing my task.