[Core] thread creation error, even with environment variables all set to 1

### What happened + What you expected to happen

This may be similar to /issues/36936, but I think the behavior of # of threads upon repeated ray.init() calls and the sensitivity of changing any of the thread environment variables (under a very high ulimit) makes this revisitable?

**What happened:**

When starting a ray cluster on a single compute node with:

`Ray start –head –node-ip-address=$(hostname) –temp-dir=/tmp/ray`

And on multiple nodes with (on each of the worker nodes)

`Ray start –worker –node-ip-address=$(hostname) address=HEAD_IP_ADDRESS`

And after the cluster successfully starting,

Calling “ray.init(address=”auto”) in a python script, and also similarly trying to use “vllm serve” causes hanging/crashing.

Upon looking at the /tmp/ray/session_latest/logs/ directory, I can find this stack trace in each of the worker .err files.

```
[2025-06-27 19:28:31,558 E 927366 929116] logging.cc:112: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]

[2025-06-27 19:28:31,872 E 927366 929116] logging.cc:119: Stack trace: 

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x1419d1a) [0x154b5b055d1a] ray::operator<<()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x141d2f2) [0x154b5b0592f2] ray::TerminateHandler()

(path_to_my_conda_env)/bin/../lib/libstdc++.so.6(+0xb643c) [0x154b5ce9943c] __cxxabiv1::__terminate()

(path_to_my_conda_env)/bin/../lib/libstdc++.so.6(+0xb648e) [0x154b5ce9948e] __cxxabiv1::__unexpected()

(path_to_my_conda_env)/bin/../lib/libstdc++.so.6(__cxa_rethrow+0) [0x154b5ce99680] __cxa_rethrow

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x70b42e) [0x154b5a34742e] boost::throw_exception<>()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f8f6b) [0x154b5b034f6b] boost::asio::detail::do_throw_error()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f998b) [0x154b5b03598b] boost::asio::detail::posix_thread::start_thread()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f9dec) [0x154b5b035dec] boost::asio::thread_pool::thread_pool()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0xd228f4) [0x154b5a95e8f4] ray::rpc::(anonymous namespace)::_GetServerCallExecutor()


(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.s
o(_ZN3ray3rpc21GetServerCallExecutorEv+0x9) [0x154b5a95e989] ray::rpc::GetServerCallExecutor()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvN3ray6StatusESt8functionIFvvEES4_EZNS0_3rpc14ServerCallImplINS6_24CoreWorkerServiceHandlerENS6_25GetCoreWorkerStatsRequestENS6_23GetCoreWorkerStatsReplyELNS6_8AuthTypeE0EE17HandleRequestImplEbEUlS1_S4_S4_E0_E9_M_invokeERKSt9_Any_dataOS1_OS4_SJ_+0x12b) [0x154b5a59124b] std::_Function_handler<>::_M_invoke()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker24HandleGetCoreWorkerStatsENS_3rpc25GetCoreWorkerStatsRequestEPNS2_23GetCoreWorkerStatsReplyESt8functionIFvNS_6StatusES6_IFvvEES9_EE+0x839) [0x154b5a5c3789] ray::core::CoreWorker::HandleGetCoreWorkerStats()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray3rpc14ServerCallImplINS0_24CoreWorkerServiceHandlerENS0_25GetCoreWorkerStatsRequestENS0_23GetCoreWorkerStatsReplyELNS0_8AuthTypeE0EE17HandleRequestImplEb+0x104) [0x154b5a5a9744] ray::rpc::ServerCallImpl<>::HandleRequestImpl()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0xd9d968) [0x154b5a9d9968] EventTracker::RecordExecution()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0xd4892e) [0x154b5a98492e] std::_Function_handler<>::_M_invoke()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0xd48da6) [0x154b5a984da6] boost::asio::detail::completion_handler<>::do_complete()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f65fb) [0x154b5b0325fb] boost::asio::detail::scheduler::do_run_one()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f7f79) [0x154b5b033f79] boost::asio::detail::scheduler::run()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0x13f8682) [0x154b5b034682] boost::asio::io_context::run()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0x91) [0x154b5a520b61] ray::core::CoreWorker::RunIOService()

(path_to_my_conda_env)/lib/python3.11/site-packages/ray/_raylet.so(+0xcbc550) [0x154b5a8f8550] thread_proxy

/lib64/libpthread.so.0(+0xa6ea) [0x154b5d8406ea] start_thread

/lib64/libc.so.6(clone+0x41) [0x154b5d60050f] clone
```

**What I’ve Tried:**

Setting these environment variables to the following values:

```
export RAY_gcs_server_rpc_server_thread_num=1
export RAY_gcs_server_rpc_client_thread_num=1
export RAY_num_server_call_thread=1
export OMP_NUM_THREADS=1
```

After these environment variables are set, a very simple python script that just calls ray.init(address=”auto”) ran without crashing or hanging, but larger programs like “vllm serve” could not run the first time successfully.

Another thing I’ve noticed is that if the first instance of ray.init(address=”auto”) on the cluster runs or fails, the total number of system threads:

`ps -eo nlwp | tail -n +2 | awk '{ num_threads += $1 } END { print num_threads }'`

Decreases with each subsequent ray.init(address=”auto”) call. 

For example, here are some values that I found with the above variables set:

Right after cluster started: 1917 threads
1 run of ray.init(address=”auto”): 4617 threads
2 run: 4205
3 run: 3685
4 run: 3161
5 run: 2641

Eventually, this bottoms out at the original number of threads that were running on a fresh cluster.

Using this, I was able to get “vllm serve” to run by first reducing the total number of threads by calling ray.init(address=”auto”) a couple of times beforehand. 

However, the performance was significantly slower than expected, likely due to the severely limited amount of threads.

When I try to unlimit the number of threads even slightly, such as by changing one of the variables from 1 to 2, the crashes happen more frequently (i.e., the first N+x ray.init() calls fail, instead of N), and if I don’t limit the threads at all, the ray.init() call will hang indefinitely, not responding to SIGINT, and also resists my previous hacky strategy of reducing the total threads by calling ray.init() several times. 

I’ve checked the maximum number of user processes (ulimit -u), which is 2060471, and the stack size, which is unlimited, so it is strange that I am hitting a thread creation error.

**What I Expected to Happen:**

Vllm serve, and other processes that call ray.init(), should be able to start without encountering thread creation issues, or at least run with threads limited to a reasonable number for speed. 



### Versions / Dependencies

**Versions / Dependencies:**

Python 3.11.8
Ray 2.46.0

**OS / Hardware:**

SUSE Linux Enterprise Server 15 SP5
Argonne Polaris Compute Node:
1 AMD EPYC "Milan" processor; (64 cores)
4 NVIDIA A100 GPUs; Unified 
Memory Architecture; 2 fabric 
endpoints; 2 NVMe SSDs

### Reproduction script

On head node:

`ray start --head `

```
python
import ray
ray.init(address="auto")
```

### Issue Severity

High: It blocks me from completing my task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] thread creation error, even with environment variables all set to 1 #54225

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Core] thread creation error, even with environment variables all set to 1 #54225

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions