Skip to content

[Core] thread creation error, even with environment variables all set to 1 #54225

@HarryYu1

Description

@HarryYu1

What happened + What you expected to happen

This may be similar to /issues/36936, but I think the behavior of # of threads upon repeated ray.init() calls and the sensitivity of changing any of the thread environment variables (under a very high ulimit) makes this revisitable?

What happened:

When starting a ray cluster on a single compute node with:

Ray start –head –node-ip-address=$(hostname) –temp-dir=/tmp/ray

And on multiple nodes with (on each of the worker nodes)

Ray start –worker –node-ip-address=$(hostname) address=HEAD_IP_ADDRESS

And after the cluster successfully starting,

Calling “ray.init(address=”auto”) in a python script, and also similarly trying to use “vllm serve” causes hanging/crashing.

Upon looking at the /tmp/ray/session_latest/logs/ directory, I can find this stack trace in each of the worker .err files.

[2025-06-27 19:28:31,558 E 927366 929116] logging.cc:112: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]

[2025-06-27 19:28:31,872 E 927366 929116] logging.cc:119: Stack trace: 

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x1419d1a) [0x154b5b055d1a] ray::operator<<()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x141d2f2) [0x154b5b0592f2] ray::TerminateHandler()

(path_to_my_conda_env)/bin/../lib/libstdc++.so.6(+0xb643c) [0x154b5ce9943c] __cxxabiv1::__terminate()

(path_to_my_conda_env)/bin/../lib/libstdc++.so.6(+0xb648e) [0x154b5ce9948e] __cxxabiv1::__unexpected()

(path_to_my_conda_env)/bin/../lib/libstdc++.so.6(__cxa_rethrow+0) [0x154b5ce99680] __cxa_rethrow

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x70b42e) [0x154b5a34742e] boost::throw_exception<>()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f8f6b) [0x154b5b034f6b] boost::asio::detail::do_throw_error()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f998b) [0x154b5b03598b] boost::asio::detail::posix_thread::start_thread()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f9dec) [0x154b5b035dec] boost::asio::thread_pool::thread_pool()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0xd228f4) [0x154b5a95e8f4] ray::rpc::(anonymous namespace)::_GetServerCallExecutor()


(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.s
o(_ZN3ray3rpc21GetServerCallExecutorEv+0x9) [0x154b5a95e989] ray::rpc::GetServerCallExecutor()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvN3ray6StatusESt8functionIFvvEES4_EZNS0_3rpc14ServerCallImplINS6_24CoreWorkerServiceHandlerENS6_25GetCoreWorkerStatsRequestENS6_23GetCoreWorkerStatsReplyELNS6_8AuthTypeE0EE17HandleRequestImplEbEUlS1_S4_S4_E0_E9_M_invokeERKSt9_Any_dataOS1_OS4_SJ_+0x12b) [0x154b5a59124b] std::_Function_handler<>::_M_invoke()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker24HandleGetCoreWorkerStatsENS_3rpc25GetCoreWorkerStatsRequestEPNS2_23GetCoreWorkerStatsReplyESt8functionIFvNS_6StatusES6_IFvvEES9_EE+0x839) [0x154b5a5c3789] ray::core::CoreWorker::HandleGetCoreWorkerStats()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray3rpc14ServerCallImplINS0_24CoreWorkerServiceHandlerENS0_25GetCoreWorkerStatsRequestENS0_23GetCoreWorkerStatsReplyELNS0_8AuthTypeE0EE17HandleRequestImplEb+0x104) [0x154b5a5a9744] ray::rpc::ServerCallImpl<>::HandleRequestImpl()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0xd9d968) [0x154b5a9d9968] EventTracker::RecordExecution()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0xd4892e) [0x154b5a98492e] std::_Function_handler<>::_M_invoke()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0xd48da6) [0x154b5a984da6] boost::asio::detail::completion_handler<>::do_complete()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f65fb) [0x154b5b0325fb] boost::asio::detail::scheduler::do_run_one()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f7f79) [0x154b5b033f79] boost::asio::detail::scheduler::run()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f8682) [0x154b5b034682] boost::asio::io_context::run()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0x91) [0x154b5a520b61] ray::core::CoreWorker::RunIOService()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0xcbc550) [0x154b5a8f8550] thread_proxy

/lib64/libpthread.so.0(+0xa6ea) [0x154b5d8406ea] start_thread

/lib64/libc.so.6(clone+0x41) [0x154b5d60050f] clone

What I’ve Tried:

Setting these environment variables to the following values:

export RAY_gcs_server_rpc_server_thread_num=1
export RAY_gcs_server_rpc_client_thread_num=1
export RAY_num_server_call_thread=1
export OMP_NUM_THREADS=1

After these environment variables are set, a very simple python script that just calls ray.init(address=”auto”) ran without crashing or hanging, but larger programs like “vllm serve” could not run the first time successfully.

Another thing I’ve noticed is that if the first instance of ray.init(address=”auto”) on the cluster runs or fails, the total number of system threads:

ps -eo nlwp | tail -n +2 | awk '{ num_threads += $1 } END { print num_threads }'

Decreases with each subsequent ray.init(address=”auto”) call.

For example, here are some values that I found with the above variables set:

Right after cluster started: 1917 threads
1 run of ray.init(address=”auto”): 4617 threads
2 run: 4205
3 run: 3685
4 run: 3161
5 run: 2641

Eventually, this bottoms out at the original number of threads that were running on a fresh cluster.

Using this, I was able to get “vllm serve” to run by first reducing the total number of threads by calling ray.init(address=”auto”) a couple of times beforehand.

However, the performance was significantly slower than expected, likely due to the severely limited amount of threads.

When I try to unlimit the number of threads even slightly, such as by changing one of the variables from 1 to 2, the crashes happen more frequently (i.e., the first N+x ray.init() calls fail, instead of N), and if I don’t limit the threads at all, the ray.init() call will hang indefinitely, not responding to SIGINT, and also resists my previous hacky strategy of reducing the total threads by calling ray.init() several times.

I’ve checked the maximum number of user processes (ulimit -u), which is 2060471, and the stack size, which is unlimited, so it is strange that I am hitting a thread creation error.

What I Expected to Happen:

Vllm serve, and other processes that call ray.init(), should be able to start without encountering thread creation issues, or at least run with threads limited to a reasonable number for speed.

Versions / Dependencies

Versions / Dependencies:

Python 3.11.8
Ray 2.46.0

OS / Hardware:

SUSE Linux Enterprise Server 15 SP5
Argonne Polaris Compute Node:
1 AMD EPYC "Milan" processor; (64 cores)
4 NVIDIA A100 GPUs; Unified
Memory Architecture; 2 fabric
endpoints; 2 NVMe SSDs

Reproduction script

On head node:

ray start --head

python
import ray
ray.init(address="auto")

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

Labels

bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CorequestionJust a question :)stabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions