⚡ Up to 4x faster: Data Parallel for vLLM server #3310

qgallouedec · 2025-04-16T21:45:02Z

Usage:

trl vllm-serve --model Qwen/Qwen2.5-1.5B --data_parallel_size 2 --tensor_parallel_size 2

For the client: nothing changes:

# demo_client.py
from trl.extras.vllm_client import VLLMClient
client = VLLMClient(connection_timeout=30)

# Generate
print(client.generate(["Hello, AI!", "Tell me a joke"] * 20))

# Transfer the model to the client
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B", device_map="cuda")
client.update_model_params(model)

CUDA_VISIBLE_DEVICES=4 demo_client.py

trl/scripts/vllm_serve.py

qgallouedec · 2025-04-17T05:26:30Z

trl/scripts/vllm_serve.py

@@ -226,6 +236,45 @@ class ScriptArguments:
    )


+def llm_worker(script_args, data_parallel_rank, connection):


The main change is that instead of instantiating a single LLM in the main process, we now need to spawn dp subprocesses — each responsible for creating its own LLM instance. We then set up communication between the main process and each subprocess.

While this approach may seem a bit more complex, it's necessary because vLLM depends heavily on environment variables and doesn't accommodate well with running multiple LLM instances within the same process.
Spawning separate subprocesses is the only reliable way to isolate and manage multiple LLM instances.

trl/scripts/vllm_serve.py

qgallouedec · 2025-04-17T05:35:29Z

trl/scripts/vllm_serve.py

+        for connection, prompts in zip(connections, chunked_prompts):
+            kwargs = {"prompts": prompts, "sampling_params": sampling_params}
+            connection.send({"type": "call", "method": "generate", "kwargs": kwargs})
+
+        # Wait for and collect all results
+        all_outputs = [connection.recv() for connection in connections]


We can't call .generate directly anymore, since the LLM instances are in subprocesses. Hence, we're sending a communication instruction, and wait for the results.

So here the outputs will not get mixed up as you go over each connection in order?

I think so, I'll check manually though as this could lead to a silent bug or unwanted behavior if the order is mixed up.

if the output gets mixed up then we can use
from collections import OrderedDict
to preserve the order of prompt-output!

@qgallouedec Did you just check this by any chance? are we getting the prompt-responses aligned with the connections?

HuggingFaceDocBuilderDev · 2025-04-17T05:38:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lewtun

Very clean implementation! LGTM with some nits and a question about whether we can support CUDA graphs when DP=1 and TP>1

trl/extras/vllm_client.py

trl/scripts/vllm_serve.py

zzzzzec · 2025-04-17T15:12:14Z

Hello, thank you very much for TRL's support for vLLM DP, which is exactly what I've been looking forward to and needing. It has greatly accelerated my experiments.

However, I encountered an issue when running vllm_serve.
My hardware configuration is 8 * H20 (96GB)
I used the command:

NCCL_DEBUG=WARN python -m trl.cli vllm-serve \
    --model /mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct \
    --tensor_parallel_size 1 \
    --data_parallel_size 8 \
    --host 0.0.0.0 \
    --port 6004

This resulted in an error:

ctmt241129162844w28-74bfc659cd-zpz6r:2095608:2095608 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095691:2095691 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 5 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095613:2095613 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095681:2095681 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 7 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095494:2095494 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 3 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095499:2095499 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 6 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095686:2095686 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 4 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095532:2095532 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 8000

After searching for the cause for a long time, I finally discovered that deleting all TRL-related code from the vllm_server.py file allows it to work normally, specifically:

# from trl import TrlParser
# from trl.import_utils import (
#     is_fastapi_available,
#     is_pydantic_available,
#     is_uvicorn_available,
#     is_vllm_available,
# )


# if is_fastapi_available():
#     from fastapi import FastAPI


# if is_pydantic_available():
#     from pydantic import BaseModel


# if is_uvicorn_available():
#     import uvicorn


# if is_vllm_available():
#     from vllm import LLM, SamplingParams
#     from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
#     from vllm.distributed.parallel_state import get_world_group
#     from vllm.distributed.utils import StatelessProcessGroup
#     from vllm.sampling_params import GuidedDecodingParams
#     from vllm.utils import get_open_port

# copy class TrlParser(HfArgumentParser): to there
...

and run

NCCL_DEBUG=WARN python vllm_serve.py \
    --model /mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct \
    --tensor_parallel_size 1 \
    --data_parallel_size 8 \
    --host 0.0.0.0 \
    --port 6004

This works correctly. I wonder if the error might be that when importing TRL, certain processes are placed on the cuda:0 device, which causes this error?

Could you please help look into this error? Thank you for your assistance.

The complete error log:

[2025-04-17 23:03:28,056] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 04-17 23:03:33 [__init__.py:239] Automatically detected platform cuda.
INFO:     Started server process [2093690]
INFO:     Waiting for application startup.
INFO 04-17 23:04:10 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-17 23:04:10 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-17 23:04:10 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-17 23:04:11 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-17 23:04:11 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-17 23:04:11 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-17 23:04:11 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-17 23:04:11 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-17 23:04:11 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-17 23:04:12 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-17 23:04:12 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-17 23:04:12 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-17 23:04:12 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-17 23:04:12 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-17 23:04:12 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-17 23:04:12 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-17 23:04:12 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 04-17 23:04:12 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-17 23:04:12 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 04-17 23:04:12 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-17 23:04:12 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-17 23:04:12 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-17 23:04:12 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-17 23:04:12 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
[2025-04-17 23:04:40,964] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 23:04:40,964] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 23:04:40,964] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 23:04:43,643] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 23:04:43,769] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 23:04:43,791] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 23:04:43,812] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 23:04:43,816] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 04-17 23:04:54 [__init__.py:239] Automatically detected platform cuda.
INFO 04-17 23:04:54 [__init__.py:239] Automatically detected platform cuda.
INFO 04-17 23:04:54 [__init__.py:239] Automatically detected platform cuda.
INFO 04-17 23:04:55 [__init__.py:239] Automatically detected platform cuda.
INFO 04-17 23:04:56 [__init__.py:239] Automatically detected platform cuda.
INFO 04-17 23:04:56 [__init__.py:239] Automatically detected platform cuda.
INFO 04-17 23:04:56 [__init__.py:239] Automatically detected platform cuda.
INFO 04-17 23:04:56 [__init__.py:239] Automatically detected platform cuda.
(EngineCore_0 pid=2095608) INFO 04-17 23:05:00 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=102400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
(EngineCore_4 pid=2095686) INFO 04-17 23:05:00 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=102400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
(EngineCore_1 pid=2095613) INFO 04-17 23:05:00 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=102400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
(EngineCore_7 pid=2095681) INFO 04-17 23:05:00 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=102400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
(EngineCore_2 pid=2095532) INFO 04-17 23:05:00 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=102400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
(EngineCore_5 pid=2095691) INFO 04-17 23:05:00 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=102400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
(EngineCore_3 pid=2095494) INFO 04-17 23:05:00 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=102400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
(EngineCore_6 pid=2095499) INFO 04-17 23:05:00 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=102400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
(EngineCore_4 pid=2095686) INFO 04-17 23:05:02 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_3 pid=2095494) INFO 04-17 23:05:02 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_7 pid=2095681) INFO 04-17 23:05:02 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_1 pid=2095613) INFO 04-17 23:05:02 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_5 pid=2095691) INFO 04-17 23:05:02 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_6 pid=2095499) INFO 04-17 23:05:02 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_0 pid=2095608) INFO 04-17 23:05:02 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_2 pid=2095532) INFO 04-17 23:05:02 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_4 pid=2095686) WARNING 04-17 23:05:02 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f004f0af4f0>
(EngineCore_1 pid=2095613) WARNING 04-17 23:05:02 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fbb174a7730>
(EngineCore_5 pid=2095691) WARNING 04-17 23:05:02 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f31e4137730>
(EngineCore_0 pid=2095608) WARNING 04-17 23:05:02 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7feca8acb250>
(EngineCore_6 pid=2095499) WARNING 04-17 23:05:02 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f5ef1ab7250>
(EngineCore_7 pid=2095681) WARNING 04-17 23:05:02 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f62fa3bb400>
(EngineCore_3 pid=2095494) WARNING 04-17 23:05:02 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f65aecef730>
(EngineCore_2 pid=2095532) WARNING 04-17 23:05:02 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f9f1dc4b4f0>
(EngineCore_5 pid=2095691) INFO 04-17 23:05:03 [parallel_state.py:836] Adjusting world_size=8 rank=5 distributed_init_method=tcp://127.0.0.1:37608 for DP
(EngineCore_0 pid=2095608) INFO 04-17 23:05:03 [parallel_state.py:836] Adjusting world_size=8 rank=0 distributed_init_method=tcp://127.0.0.1:37608 for DP
(EngineCore_1 pid=2095613) INFO 04-17 23:05:03 [parallel_state.py:836] Adjusting world_size=8 rank=1 distributed_init_method=tcp://127.0.0.1:37608 for DP
(EngineCore_2 pid=2095532) INFO 04-17 23:05:03 [parallel_state.py:836] Adjusting world_size=8 rank=2 distributed_init_method=tcp://127.0.0.1:37608 for DP
(EngineCore_7 pid=2095681) INFO 04-17 23:05:03 [parallel_state.py:836] Adjusting world_size=8 rank=7 distributed_init_method=tcp://127.0.0.1:37608 for DP
(EngineCore_6 pid=2095499) INFO 04-17 23:05:03 [parallel_state.py:836] Adjusting world_size=8 rank=6 distributed_init_method=tcp://127.0.0.1:37608 for DP
(EngineCore_4 pid=2095686) INFO 04-17 23:05:03 [parallel_state.py:836] Adjusting world_size=8 rank=4 distributed_init_method=tcp://127.0.0.1:37608 for DP
(EngineCore_3 pid=2095494) INFO 04-17 23:05:03 [parallel_state.py:836] Adjusting world_size=8 rank=3 distributed_init_method=tcp://127.0.0.1:37608 for DP
(EngineCore_1 pid=2095613) INFO 04-17 23:05:04 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_2 pid=2095532) INFO 04-17 23:05:04 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_1 pid=2095613) INFO 04-17 23:05:04 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_2 pid=2095532) INFO 04-17 23:05:04 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_3 pid=2095494) INFO 04-17 23:05:04 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_5 pid=2095691) INFO 04-17 23:05:04 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_7 pid=2095681) INFO 04-17 23:05:04 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_6 pid=2095499) INFO 04-17 23:05:04 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_0 pid=2095608) INFO 04-17 23:05:04 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_4 pid=2095686) INFO 04-17 23:05:04 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_3 pid=2095494) INFO 04-17 23:05:04 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_5 pid=2095691) INFO 04-17 23:05:04 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_7 pid=2095681) INFO 04-17 23:05:04 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_6 pid=2095499) INFO 04-17 23:05:04 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_0 pid=2095608) INFO 04-17 23:05:04 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_4 pid=2095686) INFO 04-17 23:05:04 [pynccl.py:69] vLLM is using nccl==2.21.5
NCCL version 2.21.5+cuda12.4

ctmt241129162844w28-74bfc659cd-zpz6r:2095608:2095608 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095691:2095691 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 5 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095613:2095613 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095681:2095681 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 7 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095494:2095494 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 3 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095499:2095499 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 6 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095686:2095686 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 4 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095532:2095532 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 8000
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 564, in __init__
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(input_path, output_path, vllm_config, executor_class,
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     self.model_executor = executor_class(vllm_config)
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     self._init_executor()
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     self.collective_rpc("init_device")
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 564, in __init__
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(input_path, output_path, vllm_config, executor_class,
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     return func(*args, **kwargs)
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     self.model_executor = executor_class(vllm_config)
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     self.worker.init_device()  # type: ignore
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     self._init_executor()
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     self.collective_rpc("init_device")
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     return func(*args, **kwargs)
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     _DP = init_model_parallel_group(group_ranks,
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     self.worker.init_device()  # type: ignore
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     return GroupCoordinator(
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 209, in __init__
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     self.device_communicator = device_comm_cls(
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     self.pynccl_comm = PyNcclCommunicator(
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 101, in __init__
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 279, in ncclCommInitRank
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     _DP = init_model_parallel_group(group_ranks,
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 258, in NCCL_CHECK
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     return GroupCoordinator(
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 209, in __init__
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390] 
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     self.device_communicator = device_comm_cls(
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     self.pynccl_comm = PyNcclCommunicator(
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 101, in __init__
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 279, in ncclCommInitRank
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 258, in NCCL_CHECK
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390] 
CRITICAL 04-17 23:05:06 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-17 23:05:06 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 564, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(input_path, output_path, vllm_config, executor_class,
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     self.model_executor = executor_class(vllm_config)
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     self._init_executor()
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     self.collective_rpc("init_device")
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     return func(*args, **kwargs)
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     self.worker.init_device()  # type: ignore
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     _DP = init_model_parallel_group(group_ranks,
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     return GroupCoordinator(
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 209, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     self.device_communicator = device_comm_cls(
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     self.pynccl_comm = PyNcclCommunicator(
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 101, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 279, in ncclCommInitRank
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 564, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(input_path, output_path, vllm_config, executor_class,
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 258, in NCCL_CHECK
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390] 
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     self.model_executor = executor_class(vllm_config)
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 564, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     self._init_executor()
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(input_path, output_path, vllm_config, executor_class,
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     self.collective_rpc("init_device")
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     self.model_executor = executor_class(vllm_config)
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     return func(*args, **kwargs)
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     self._init_executor()
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     self.worker.init_device()  # type: ignore
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     self.collective_rpc("init_device")
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     return func(*args, **kwargs)
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     self.worker.init_device()  # type: ignore
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     _DP = init_model_parallel_group(group_ranks,
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     return GroupCoordinator(
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 209, in __init__
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     self.device_communicator = device_comm_cls(
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 564, in __init__
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 564, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     self.pynccl_comm = PyNcclCommunicator(
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(input_path, output_path, vllm_config, executor_class,
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(input_path, output_path, vllm_config, executor_class,
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 101, in __init__
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     _DP = init_model_parallel_group(group_ranks,
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 279, in ncclCommInitRank
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     return GroupCoordinator(
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     self.model_executor = executor_class(vllm_config)
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     self.model_executor = executor_class(vllm_config)
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 258, in NCCL_CHECK
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 209, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     self.device_communicator = device_comm_cls(
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     self._init_executor()
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     self._init_executor()
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390] 
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     self.pynccl_comm = PyNcclCommunicator(
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     self.collective_rpc("init_device")
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 101, in __init__
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     self.collective_rpc("init_device")
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 564, in __init__
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 279, in ncclCommInitRank
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(input_path, output_path, vllm_config, executor_class,
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     return func(*args, **kwargs)
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     return func(*args, **kwargs)
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 258, in NCCL_CHECK
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     self.worker.init_device()  # type: ignore
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     self.worker.init_device()  # type: ignore
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
CRITICAL 04-17 23:05:06 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     self.model_executor = executor_class(vllm_config)
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390] 
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     self._init_executor()
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     self.collective_rpc("init_device")
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     _DP = init_model_parallel_group(group_ranks,
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     _DP = init_model_parallel_group(group_ranks,
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     return GroupCoordinator(
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     return func(*args, **kwargs)
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 209, in __init__
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     return GroupCoordinator(
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     self.device_communicator = device_comm_cls(
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 209, in __init__
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     self.worker.init_device()  # type: ignore
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     self.device_communicator = device_comm_cls(
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     self.pynccl_comm = PyNcclCommunicator(
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     self.pynccl_comm = PyNcclCommunicator(
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 101, in __init__
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 101, in __init__
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 279, in ncclCommInitRank
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 279, in ncclCommInitRank
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 258, in NCCL_CHECK
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 258, in NCCL_CHECK
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     _DP = init_model_parallel_group(group_ranks,
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390] 
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390] 
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     return GroupCoordinator(
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 209, in __init__
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     self.device_communicator = device_comm_cls(
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     self.pynccl_comm = PyNcclCommunicator(
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 101, in __init__
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 279, in ncclCommInitRank
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 258, in NCCL_CHECK
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390] 
CRITICAL 04-17 23:05:06 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-17 23:05:06 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-17 23:05:06 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-17 23:05:06 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-17 23:05:06 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
ERROR:    Traceback (most recent call last):
  File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/starlette/routing.py", line 692, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/mnt/tenant-home_speed/shard/zhangenci/research/trl/scripts/vllm_serve.py", line 309, in lifespan
    msg = connection.recv()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

ERROR:    Application startup failed. Exiting.

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

…to vllm-serve-dp

qgallouedec · 2025-04-24T21:46:24Z

Training a 7B with GRPO on two nodes (1 for vLLM, one for training). Generation is way faster than before!

qgallouedec · 2025-04-24T22:00:37Z

ahatamiz · 2025-04-25T23:19:52Z

Hi @qgallouedec @zzzzzec

Unfortunately, I don't believe this feature works properly ! I am not able to run anything with DP>1 as I get this weird error (log for 2 nodes for the DeepSeek-R1-Distill-Qwen-7B, using trl=0.17.0 and vllm==0.8.3):

(EngineCore_1 pid=2353892) INFO 04-25 16:14:42 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_0 pid=2353893) INFO 04-25 16:14:42 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_6 pid=2353882) INFO 04-25 16:14:42 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_2 pid=2353883) INFO 04-25 16:14:42 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_3 pid=2353898) WARNING 04-25 16:14:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x155408633710>
(EngineCore_1 pid=2353892) WARNING 04-25 16:14:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x15530242e790>
(EngineCore_0 pid=2353893) WARNING 04-25 16:14:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x155302dcf8d0>
(EngineCore_7 pid=2353909) INFO 04-25 16:14:42 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_6 pid=2353882) WARNING 04-25 16:14:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x15540862f0d0>
(EngineCore_2 pid=2353883) WARNING 04-25 16:14:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x155302442110>
(EngineCore_4 pid=2353914) INFO 04-25 16:14:42 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_7 pid=2353909) WARNING 04-25 16:14:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x155554fdde90>
(EngineCore_4 pid=2353914) WARNING 04-25 16:14:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x15530242e350>
(EngineCore_5 pid=2353903) INFO 04-25 16:14:42 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_5 pid=2353903) WARNING 04-25 16:14:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x155302431e90>
(EngineCore_6 pid=2353882) INFO 04-25 16:14:43 [parallel_state.py:836] Adjusting world_size=8 rank=6 distributed_init_method=tcp://127.0.0.1:12730 for DP
(EngineCore_1 pid=2353892) INFO 04-25 16:14:43 [parallel_state.py:836] Adjusting world_size=8 rank=1 distributed_init_method=tcp://127.0.0.1:12730 for DP
(EngineCore_0 pid=2353893) INFO 04-25 16:14:43 [parallel_state.py:836] Adjusting world_size=8 rank=0 distributed_init_method=tcp://127.0.0.1:12730 for DP
(EngineCore_4 pid=2353914) INFO 04-25 16:14:43 [parallel_state.py:836] Adjusting world_size=8 rank=4 distributed_init_method=tcp://127.0.0.1:12730 for DP
(EngineCore_2 pid=2353883) INFO 04-25 16:14:43 [parallel_state.py:836] Adjusting world_size=8 rank=2 distributed_init_method=tcp://127.0.0.1:12730 for DP
(EngineCore_5 pid=2353903) INFO 04-25 16:14:43 [parallel_state.py:836] Adjusting world_size=8 rank=5 distributed_init_method=tcp://127.0.0.1:12730 for DP
(EngineCore_7 pid=2353909) INFO 04-25 16:14:43 [parallel_state.py:836] Adjusting world_size=8 rank=7 distributed_init_method=tcp://127.0.0.1:12730 for DP
(EngineCore_3 pid=2353898) INFO 04-25 16:14:43 [parallel_state.py:836] Adjusting world_size=8 rank=3 distributed_init_method=tcp://127.0.0.1:12730 for DP
(EngineCore_6 pid=2353882) INFO 04-25 16:14:43 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_6 pid=2353882) INFO 04-25 16:14:43 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_0 pid=2353893) INFO 04-25 16:14:43 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_2 pid=2353883) INFO 04-25 16:14:43 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_0 pid=2353893) INFO 04-25 16:14:43 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_7 pid=2353909) INFO 04-25 16:14:43 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_3 pid=2353898) INFO 04-25 16:14:43 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_1 pid=2353892) INFO 04-25 16:14:43 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_4 pid=2353914) INFO 04-25 16:14:43 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_2 pid=2353883) INFO 04-25 16:14:43 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_5 pid=2353903) INFO 04-25 16:14:43 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_7 pid=2353909) INFO 04-25 16:14:43 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_3 pid=2353898) INFO 04-25 16:14:43 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_4 pid=2353914) INFO 04-25 16:14:43 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_1 pid=2353892) INFO 04-25 16:14:43 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_5 pid=2353903) INFO 04-25 16:14:43 [pynccl.py:69] vLLM is using nccl==2.21.5
NCCL version 2.21.5+cuda12.4
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 559, in init
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] super().init(input_path, output_path, vllm_config, executor_class,
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in init
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] super().init(vllm_config, executor_class, log_stats)
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 67, in init
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] self.model_executor = executor_class(vllm_config)
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] self._init_executor()
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] self.collective_rpc("init_device")
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] return func(*args, **kwargs)
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] self.worker.init_device() # type: ignore
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] _DP = init_model_parallel_group(group_ranks,
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] return GroupCoordinator(
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 209, in init
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] self.device_communicator = device_comm_cls(
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in init
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] self.pynccl_comm = PyNcclCommunicator(
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in init
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390]
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 559, in init
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] super().init(input_path, output_path, vllm_config, executor_class,
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in init
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] super().init(vllm_config, executor_class, log_stats)
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 67, in init
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] self.model_executor = executor_class(vllm_config)
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] self._init_executor()
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] self.collective_rpc("init_device")
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] return func(*args, **kwargs)
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] self.worker.init_device() # type: ignore
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] _DP = init_model_parallel_group(group_ranks,
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] return GroupCoordinator(
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 209, in init
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] self.device_communicator = device_comm_cls(
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in init
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] self.pynccl_comm = PyNcclCommunicator(
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in init
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390]
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 559, in init
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] super().init(input_path, output_path, vllm_config, executor_class,
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in init
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] super().init(vllm_config, executor_class, log_stats)
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 67, in init
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] self.model_executor = executor_class(vllm_config)
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] self._init_executor()
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] self.collective_rpc("init_device")
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] return func(*args, **kwargs)
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] self.worker.init_device() # type: ignore
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] _DP = init_model_parallel_group(group_ranks,
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] return GroupCoordinator(
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 209, in init
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] self.device_communicator = device_comm_cls(
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in init
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] self.pynccl_comm = PyNcclCommunicator(
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in init
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390]
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 559, in init
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] super().init(input_path, output_path, vllm_config, executor_class,
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in init
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] super().init(vllm_config, executor_class, log_stats)
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 67, in init
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] self.model_executor = executor_class(vllm_config)
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] self._init_executor()
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] self.collective_rpc("init_device")
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] return func(*args, **kwargs)
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] self.worker.init_device() # type: ignore
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] _DP = init_model_parallel_group(group_ranks,
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] return GroupCoordinator(
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 209, in init
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] self.device_communicator = device_comm_cls(
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in init
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] self.pynccl_comm = PyNcclCommunicator(
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in init
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390]
CRITICAL 04-25 16:14:44 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 559, in init
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] super().init(input_path, output_path, vllm_config, executor_class,
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in init
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] super().init(vllm_config, executor_class, log_stats)
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 67, in init
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] self.model_executor = executor_class(vllm_config)
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] self._init_executor()
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] self.collective_rpc("init_device")
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] return func(*args, **kwargs)
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] self.worker.init_device() # type: ignore
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] _DP = init_model_parallel_group(group_ranks,
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] return GroupCoordinator(
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 209, in init
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] self.device_communicator = device_comm_cls(
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in init
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] self.pynccl_comm = PyNcclCommunicator(
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in init
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390]
CRITICAL 04-25 16:14:44 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-25 16:14:44 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-25 16:14:44 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 559, in init
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] super().init(input_path, output_path, vllm_config, executor_class,
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in init
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] super().init(vllm_config, executor_class, log_stats)
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 67, in init
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] self.model_executor = executor_class(vllm_config)
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] self._init_executor()
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] self.collective_rpc("init_device")
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] return func(*args, **kwargs)
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] self.worker.init_device() # type: ignore
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] _DP = init_model_parallel_group(group_ranks,
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] return GroupCoordinator(
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 209, in init
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] self.device_communicator = device_comm_cls(
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in init
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] self.pynccl_comm = PyNcclCommunicator(
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in init
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390]
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 559, in init
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] super().init(input_path, output_path, vllm_config, executor_class,
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in init
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] super().init(vllm_config, executor_class, log_stats)
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 67, in init
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] self.model_executor = executor_class(vllm_config)
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] self._init_executor()
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] self.collective_rpc("init_device")
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] return func(*args, **kwargs)
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] self.worker.init_device() # type: ignore
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] _DP = init_model_parallel_group(group_ranks,
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] return GroupCoordinator(
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 209, in init
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] self.device_communicator = device_comm_cls(
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in init
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] self.pynccl_comm = PyNcclCommunicator(
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in init
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390]
CRITICAL 04-25 16:14:44 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 559, in init
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] super().init(input_path, output_path, vllm_config, executor_class,
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in init
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] super().init(vllm_config, executor_class, log_stats)
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 67, in init
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] self._init_executor()
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] self.collective_rpc("init_device")
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] return func(*args, **kwargs)
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] self.worker.init_device() # type: ignore
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] _DP = init_model_parallel_group(group_ranks,
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] return GroupCoordinator(
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 209, in init
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] self.device_communicator = device_comm_cls(
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in init
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] self.pynccl_comm = PyNcclCommunicator(
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in init
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390]
CRITICAL 04-25 16:14:44 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-25 16:14:44 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-25 16:14:44 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.

batch-block7-00733:2353893:2353893 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device f000

batch-block7-00733:2353892:2353892 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device f000

batch-block7-00733:2353883:2353883 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 2 and rank 0 both on CUDA device f000
ERROR: Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/starlette/routing.py", line 692, in lifespan
async with self.lifespan_context(app) as maybe_state:
File "/opt/conda/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trl/scripts/vllm_serve.py", line 362, in lifespan
msg = connection.recv()
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 430, in _recv_bytes
buf = self._recv(4)
^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 399, in _recv
raise EOFError
EOFError

ERROR: Application startup failed. Exiting.

qgallouedec · 2025-04-25T23:22:12Z

What command do you use?

ahatamiz · 2025-04-25T23:26:01Z

To run the vllm part, I use this:

srun \
  --nodes=1 \
  --ntasks=1 \
  --nodelist="$VLLM_NODE" \
  --container-image="$IMAGE" \
  --container-env=ALL \
  --container-mounts="/home/${USER}:/home/${USER}" \
  --container-workdir="$OUTPUT_ROOT" \
  --output="${LOGS_DIR}/vllm_%x_${DATETIME}.log" \
bash -c "
  echo \"[vLLM Node] Starting vLLM on \$(hostname -s)\"
  CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \\
    trl vllm-serve \\
      --model \"$MODEL\" \\
      --tensor_parallel_size \"$TP\" \\
      --data_parallel_size \"$DP\" \\
      --host \"$VLLM_NODE\"
"

The above works fine if DP=1 and TP set properly.

qgallouedec · 2025-04-25T23:29:56Z

It could be related to this:

trl/trl/cli.py

Lines 100 to 106 in 29c5e05

    
           if script_args.tensor_parallel_size == 1 and script_args.data_parallel_size > 1: 
        
               warnings.warn( 
        
                   "Detected configuration: tensor_parallel_size=1 and data_parallel_size>1. This setup is known to " 
        
                   "cause a crash when using the `trl vllm-serve` CLI entry point. As a workaround, please run the " 
        
                   "server using the module path instead: `python -m trl.scripts.vllm_serve`", 
        
                   RuntimeWarning, 
        
               )

Try to replace trl vllm-serve by python -m trl.scripts.vllm_serve.

ahatamiz · 2025-04-25T23:32:43Z

Thanks @qgallouedec ! just tested it with TP=1 and DP=8 and it works !

ahatamiz · 2025-04-26T02:26:49Z

@qgallouedec unfortunately the issue seems to persist despite seemingly being resolved at first. This time, we use TP=1 and DP=8. I'd appreciate if you may have any insights here:

Processed prompts:  25%|██▌       | 4/16 [00:47<02:21, 11.78s/it, est. speed input: 365.46 toks/s, output: 125.39 toProcessed prompts:  25%|██▌       | 4/16 [00:57<02:52, 14.40s/it, est. speed input: 281.88 toks/s, output: 112.20 toProcessed prompts:  25%|██▌       | 4/16 [00:59<02:59, 14.94s/it, est. speed input: 283.54 toks/s, output: 120.78 toProcessed prompts:  50%|█████     | 8/16 [01:04<00:54,  6.78s/it, est. speed input: 527.17 toks/s, output: 239.39 toProcessed prompts:  25%|██▌       | 4/16 [01:04<03:12, 16.01s/it, est. speed input: 263.78 toks/s, output: 127.95 toProcessed prompts:  25%|██▌       | 4/16 [01:04<03:12, 16.01s/it, est. speed input: 271.25 toks/s, output: 127.94 toProcessed prompts:  25%|██▌       | 4/16 [01:04<03:12, 16.01s/it, est. speed input: 270.38 toks/s, output: 127.91 toProcessed prompts:  50%|█████     | 8/16 [01:04<00:58,  7.34s/it, est. speed input: 532.39 toks/s, output: 218.91 toProcessed prompts:  50%|█████     | 8/16 [01:04<00:55,  6.88s/it, est. speed input: 519.56 toks/s, output: 228.68 toProcessed prompts:  25%|██▌       | 4/16 [01:04<03:12, 16.01s/it, est. speed input: 267.23 toks/s, output: 116.30 toProcessed prompts:  25%|██▌       | 4/16 [01:04<03:12, 16.01s/it, est. speed input: 265.71 toks/s, output: 127.92 toProcessed prompts: 100%|██████████| 16/16 [01:04<00:00,  4.01s/it, est. speed input: 1058.16 toks/s, output: 474.48 toks/s]
Processed prompts: 100%|██████████| 16/16 [01:04<00:00,  4.01s/it, est. speed input: 1004.50 toks/s, output: 486.24 toks/s]
Processed prompts: 100%|██████████| 16/16 [01:04<00:00,  4.00s/it, est. speed input: 1084.13 toks/s, output: 504.99 toks/s]
Processed prompts: 100%|██████████| 16/16 [01:04<00:00,  4.01s/it, est. speed input: 1069.00 toks/s, output: 509.25 toks/s]
Processed prompts: 100%|██████████| 16/16 [01:04<00:00,  4.01s/it, est. speed input: 1051.96 toks/s, output: 494.82 toks/s]
Processed prompts: 100%|██████████| 16/16 [01:04<00:00,  4.01s/it, est. speed input: 1044.32 toks/s, output: 484.14 toks/s]
Processed prompts: 100%|██████████| 16/16 [01:04<00:00,  4.00s/it, est. speed input: 1068.81 toks/s, output: 499.86 toks/s]
Processed prompts: 100%|██████████| 16/16 [01:04<00:00,  4.01s/it, est. speed input: 1060.76 toks/s, output: 511.16 toks/s]
INFO:     10.49.161.223:13970 - "POST /generate/ HTTP/1.1" 200 OK
INFO:     10.49.161.223:39872 - "POST /update_named_param/ HTTP/1.1" 200 OK
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459] Invocation of collective_rpc method failed
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459] Traceback (most recent call last):
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]   File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 456, in _handle_client_request
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]     output.result = method(
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]                     ^^^^^^^
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]   File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 303, in collective_rpc
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]     return self.model_executor.collective_rpc(method, timeout, args,
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]   File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]   File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]     return func(*args, **kwargs)
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]   File "/opt/conda/lib/python3.11/site-packages/trl/scripts/vllm_serve.py", line 126, in update_named_param
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]     raise RuntimeError("Communicator not initialized. Call `init_communicator` first.")
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459] RuntimeError: Communicator not initialized. Call `init_communicator` first.

It seems like we do indeed finish several completions, before running into this weird error !

qgallouedec · 2025-04-26T02:50:32Z

Are you using a modified version of GRPO?

ahatamiz · 2025-04-26T02:57:32Z

Yes ! but it works without any issues with the previous version which is basically TP=1 and DP=1.

qgallouedec · 2025-04-26T03:03:38Z

Try adding this line:

  self.vllm_client = VLLMClient(args.vllm_server_host, args.vllm_server_port, connection_timeout=args.vllm_server_timeout)
+ self.vllm_client.init_communicator()

ahatamiz · 2025-04-26T04:08:56Z

Thanks @qgallouedec ! kick started another training with TP=1 and DP=8 and have not noticed any issues at least for now.

Hopefully the issue is resolved.

Thanks again for your amazing work !

initial commit

b1f7a8e

shirinyamani reviewed Apr 16, 2025

View reviewed changes

trl/scripts/vllm_serve.py Outdated Show resolved Hide resolved

qgallouedec and others added 7 commits April 17, 2025 00:23

works, mostly

70a2bb9

no need background_tasks

b073016

fully support dp

1a57c89

fiix client rank

c07967d

conn -> connection

b559a2b

document

51f5f0c

Merge branch 'main' into vllm-serve-dp

96cb1f2

qgallouedec marked this pull request as ready for review April 17, 2025 05:18

qgallouedec requested review from kashif, edbeeching, lewtun and shirinyamani April 17, 2025 05:21

qgallouedec commented Apr 17, 2025

View reviewed changes

trl/scripts/vllm_serve.py Outdated Show resolved Hide resolved

Update trl/scripts/vllm_serve.py

f1c7a39

qgallouedec commented Apr 17, 2025

View reviewed changes

trl/scripts/vllm_serve.py Outdated Show resolved Hide resolved

qgallouedec added 2 commits April 17, 2025 05:31

fix comment

a67652f

better naming

e7a52a0

qgallouedec commented Apr 17, 2025

View reviewed changes

lifespan

6337667

lewtun approved these changes Apr 17, 2025

View reviewed changes

qgallouedec and others added 5 commits April 17, 2025 09:55

Update trl/extras/vllm_client.py

266003a

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Update trl/scripts/vllm_serve.py

11b497b

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

clarify

4120b4f

style and fix type hint

a433eff

Update trl/scripts/vllm_serve.py

b92b5e4

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

qgallouedec added 3 commits April 23, 2025 22:36

import fix

95917b7

Merge branch 'vllm-serve-dp' of https://github.com/huggingface/trl in…

9f5348b

…to vllm-serve-dp

draft doc

807c142

shirinyamani self-requested a review April 24, 2025 15:14

shirinyamani approved these changes Apr 24, 2025

View reviewed changes

qgallouedec added 5 commits April 24, 2025 18:33

dedicated section for vLLM

a823781

init communicator not in init

7539cd0

revert change in doc

6449c42

Section under construction. Feel free to contribute

496798f

move doc

49bebcd

qgallouedec changed the title ~~🗂️ Data Parallel for vLLM server~~ 🗂️ Up to 4x speed-up: Data Parallel for vLLM server Apr 24, 2025

qgallouedec changed the title ~~🗂️ Up to 4x speed-up: Data Parallel for vLLM server~~ ⚡ Up to 4x speed-up: Data Parallel for vLLM server Apr 24, 2025

documenation images

a72d158

increase connection timeout

633c220

qgallouedec changed the title ~~⚡ Up to 4x speed-up: Data Parallel for vLLM server~~ ⚡ Up to 4x faster: Data Parallel for vLLM server Apr 24, 2025

qgallouedec merged commit 36685c8 into main Apr 24, 2025
9 of 10 checks passed

qgallouedec deleted the vllm-serve-dp branch April 24, 2025 22:14

qgallouedec mentioned this pull request Apr 24, 2025

vllm-dp-support #3303

Closed

5 tasks

		@@ -226,6 +236,45 @@ class ScriptArguments:
		)


		def llm_worker(script_args, data_parallel_rank, connection):

⚡ Up to 4x faster: Data Parallel for vLLM server #3310

⚡ Up to 4x faster: Data Parallel for vLLM server #3310

Uh oh!

Conversation

qgallouedec commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

qgallouedec Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

qgallouedec Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kashif Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

shirinyamani Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

shirinyamani Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 17, 2025

Uh oh!

lewtun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zzzzzec commented Apr 17, 2025

Uh oh!

qgallouedec commented Apr 24, 2025

Uh oh!

qgallouedec commented Apr 24, 2025

Uh oh!

Uh oh!

ahatamiz commented Apr 25, 2025

Uh oh!

qgallouedec commented Apr 25, 2025

Uh oh!

ahatamiz commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Apr 25, 2025

Uh oh!

ahatamiz commented Apr 25, 2025

Uh oh!

ahatamiz commented Apr 26, 2025

Uh oh!

qgallouedec commented Apr 26, 2025

Uh oh!

ahatamiz commented Apr 26, 2025

Uh oh!

qgallouedec commented Apr 26, 2025

Uh oh!

ahatamiz commented Apr 26, 2025

Uh oh!

Uh oh!

qgallouedec commented Apr 16, 2025 •

edited

Loading

qgallouedec Apr 17, 2025 •

edited

Loading

shirinyamani Apr 18, 2025 •

edited

Loading

ahatamiz commented Apr 25, 2025 •

edited

Loading