Skip to content

Conversation

qgallouedec
Copy link
Member

@qgallouedec qgallouedec commented Apr 16, 2025

Usage:

trl vllm-serve --model Qwen/Qwen2.5-1.5B --data_parallel_size 2 --tensor_parallel_size 2

For the client: nothing changes:

# demo_client.py
from trl.extras.vllm_client import VLLMClient
client = VLLMClient(connection_timeout=30)

# Generate
print(client.generate(["Hello, AI!", "Tell me a joke"] * 20))

# Transfer the model to the client
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B", device_map="cuda")
client.update_model_params(model)
CUDA_VISIBLE_DEVICES=4 demo_client.py

@qgallouedec qgallouedec marked this pull request as ready for review April 17, 2025 05:18
@@ -226,6 +236,45 @@ class ScriptArguments:
)


def llm_worker(script_args, data_parallel_rank, connection):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change is that instead of instantiating a single LLM in the main process, we now need to spawn dp subprocesses — each responsible for creating its own LLM instance. We then set up communication between the main process and each subprocess.

While this approach may seem a bit more complex, it's necessary because vLLM depends heavily on environment variables and doesn't accommodate well with running multiple LLM instances within the same process.
Spawning separate subprocesses is the only reliable way to isolate and manage multiple LLM instances.

Comment on lines 408 to 413
for connection, prompts in zip(connections, chunked_prompts):
kwargs = {"prompts": prompts, "sampling_params": sampling_params}
connection.send({"type": "call", "method": "generate", "kwargs": kwargs})

# Wait for and collect all results
all_outputs = [connection.recv() for connection in connections]
Copy link
Member Author

@qgallouedec qgallouedec Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't call .generate directly anymore, since the LLM instances are in subprocesses. Hence, we're sending a communication instruction, and wait for the results.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here the outputs will not get mixed up as you go over each connection in order?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, I'll check manually though as this could lead to a silent bug or unwanted behavior if the order is mixed up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the output gets mixed up then we can use
from collections import OrderedDict
to preserve the order of prompt-output!

Copy link
Member

@shirinyamani shirinyamani Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qgallouedec Did you just check this by any chance? are we getting the prompt-responses aligned with the connections?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean implementation! LGTM with some nits and a question about whether we can support CUDA graphs when DP=1 and TP>1

@zzzzzec
Copy link

zzzzzec commented Apr 17, 2025

Hello, thank you very much for TRL's support for vLLM DP, which is exactly what I've been looking forward to and needing. It has greatly accelerated my experiments.

However, I encountered an issue when running vllm_serve.
My hardware configuration is 8 * H20 (96GB)
I used the command:

NCCL_DEBUG=WARN python -m trl.cli vllm-serve \
    --model /mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct \
    --tensor_parallel_size 1 \
    --data_parallel_size 8 \
    --host 0.0.0.0 \
    --port 6004 

This resulted in an error:

ctmt241129162844w28-74bfc659cd-zpz6r:2095608:2095608 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095691:2095691 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 5 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095613:2095613 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095681:2095681 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 7 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095494:2095494 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 3 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095499:2095499 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 6 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095686:2095686 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 4 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095532:2095532 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 8000

After searching for the cause for a long time, I finally discovered that deleting all TRL-related code from the vllm_server.py file allows it to work normally, specifically:

# from trl import TrlParser
# from trl.import_utils import (
#     is_fastapi_available,
#     is_pydantic_available,
#     is_uvicorn_available,
#     is_vllm_available,
# )


# if is_fastapi_available():
#     from fastapi import FastAPI


# if is_pydantic_available():
#     from pydantic import BaseModel


# if is_uvicorn_available():
#     import uvicorn


# if is_vllm_available():
#     from vllm import LLM, SamplingParams
#     from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
#     from vllm.distributed.parallel_state import get_world_group
#     from vllm.distributed.utils import StatelessProcessGroup
#     from vllm.sampling_params import GuidedDecodingParams
#     from vllm.utils import get_open_port

# copy class TrlParser(HfArgumentParser): to there
...

and run

NCCL_DEBUG=WARN python vllm_serve.py \
    --model /mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct \
    --tensor_parallel_size 1 \
    --data_parallel_size 8 \
    --host 0.0.0.0 \
    --port 6004 

This works correctly. I wonder if the error might be that when importing TRL, certain processes are placed on the cuda:0 device, which causes this error?

Could you please help look into this error? Thank you for your assistance.

The complete error log:

[2025-04-17 23:03:28,056] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 04-17 23:03:33 [__init__.py:239] Automatically detected platform cuda.
INFO:     Started server process [2093690]
INFO:     Waiting for application startup.
INFO 04-17 23:04:10 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-17 23:04:10 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-17 23:04:10 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-17 23:04:11 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-17 23:04:11 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-17 23:04:11 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-17 23:04:11 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-17 23:04:11 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-17 23:04:11 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-17 23:04:12 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-17 23:04:12 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-17 23:04:12 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-17 23:04:12 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-17 23:04:12 [config.py:600] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-17 23:04:12 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-17 23:04:12 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-17 23:04:12 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 04-17 23:04:12 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-17 23:04:12 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 04-17 23:04:12 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-17 23:04:12 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-17 23:04:12 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-17 23:04:12 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-17 23:04:12 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
[2025-04-17 23:04:40,964] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 23:04:40,964] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 23:04:40,964] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 23:04:43,643] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 23:04:43,769] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 23:04:43,791] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 23:04:43,812] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-04-17 23:04:43,816] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 04-17 23:04:54 [__init__.py:239] Automatically detected platform cuda.
INFO 04-17 23:04:54 [__init__.py:239] Automatically detected platform cuda.
INFO 04-17 23:04:54 [__init__.py:239] Automatically detected platform cuda.
INFO 04-17 23:04:55 [__init__.py:239] Automatically detected platform cuda.
INFO 04-17 23:04:56 [__init__.py:239] Automatically detected platform cuda.
INFO 04-17 23:04:56 [__init__.py:239] Automatically detected platform cuda.
INFO 04-17 23:04:56 [__init__.py:239] Automatically detected platform cuda.
INFO 04-17 23:04:56 [__init__.py:239] Automatically detected platform cuda.
(EngineCore_0 pid=2095608) INFO 04-17 23:05:00 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=102400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
(EngineCore_4 pid=2095686) INFO 04-17 23:05:00 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=102400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
(EngineCore_1 pid=2095613) INFO 04-17 23:05:00 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=102400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
(EngineCore_7 pid=2095681) INFO 04-17 23:05:00 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=102400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
(EngineCore_2 pid=2095532) INFO 04-17 23:05:00 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=102400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
(EngineCore_5 pid=2095691) INFO 04-17 23:05:00 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=102400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
(EngineCore_3 pid=2095494) INFO 04-17 23:05:00 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=102400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
(EngineCore_6 pid=2095499) INFO 04-17 23:05:00 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=102400, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/tenant-home_speed/Model/Qwen/Qwen2.5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
(EngineCore_4 pid=2095686) INFO 04-17 23:05:02 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_3 pid=2095494) INFO 04-17 23:05:02 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_7 pid=2095681) INFO 04-17 23:05:02 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_1 pid=2095613) INFO 04-17 23:05:02 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_5 pid=2095691) INFO 04-17 23:05:02 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_6 pid=2095499) INFO 04-17 23:05:02 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_0 pid=2095608) INFO 04-17 23:05:02 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_2 pid=2095532) INFO 04-17 23:05:02 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_4 pid=2095686) WARNING 04-17 23:05:02 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f004f0af4f0>
(EngineCore_1 pid=2095613) WARNING 04-17 23:05:02 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fbb174a7730>
(EngineCore_5 pid=2095691) WARNING 04-17 23:05:02 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f31e4137730>
(EngineCore_0 pid=2095608) WARNING 04-17 23:05:02 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7feca8acb250>
(EngineCore_6 pid=2095499) WARNING 04-17 23:05:02 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f5ef1ab7250>
(EngineCore_7 pid=2095681) WARNING 04-17 23:05:02 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f62fa3bb400>
(EngineCore_3 pid=2095494) WARNING 04-17 23:05:02 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f65aecef730>
(EngineCore_2 pid=2095532) WARNING 04-17 23:05:02 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f9f1dc4b4f0>
(EngineCore_5 pid=2095691) INFO 04-17 23:05:03 [parallel_state.py:836] Adjusting world_size=8 rank=5 distributed_init_method=tcp://127.0.0.1:37608 for DP
(EngineCore_0 pid=2095608) INFO 04-17 23:05:03 [parallel_state.py:836] Adjusting world_size=8 rank=0 distributed_init_method=tcp://127.0.0.1:37608 for DP
(EngineCore_1 pid=2095613) INFO 04-17 23:05:03 [parallel_state.py:836] Adjusting world_size=8 rank=1 distributed_init_method=tcp://127.0.0.1:37608 for DP
(EngineCore_2 pid=2095532) INFO 04-17 23:05:03 [parallel_state.py:836] Adjusting world_size=8 rank=2 distributed_init_method=tcp://127.0.0.1:37608 for DP
(EngineCore_7 pid=2095681) INFO 04-17 23:05:03 [parallel_state.py:836] Adjusting world_size=8 rank=7 distributed_init_method=tcp://127.0.0.1:37608 for DP
(EngineCore_6 pid=2095499) INFO 04-17 23:05:03 [parallel_state.py:836] Adjusting world_size=8 rank=6 distributed_init_method=tcp://127.0.0.1:37608 for DP
(EngineCore_4 pid=2095686) INFO 04-17 23:05:03 [parallel_state.py:836] Adjusting world_size=8 rank=4 distributed_init_method=tcp://127.0.0.1:37608 for DP
(EngineCore_3 pid=2095494) INFO 04-17 23:05:03 [parallel_state.py:836] Adjusting world_size=8 rank=3 distributed_init_method=tcp://127.0.0.1:37608 for DP
(EngineCore_1 pid=2095613) INFO 04-17 23:05:04 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_2 pid=2095532) INFO 04-17 23:05:04 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_1 pid=2095613) INFO 04-17 23:05:04 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_2 pid=2095532) INFO 04-17 23:05:04 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_3 pid=2095494) INFO 04-17 23:05:04 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_5 pid=2095691) INFO 04-17 23:05:04 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_7 pid=2095681) INFO 04-17 23:05:04 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_6 pid=2095499) INFO 04-17 23:05:04 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_0 pid=2095608) INFO 04-17 23:05:04 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_4 pid=2095686) INFO 04-17 23:05:04 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_3 pid=2095494) INFO 04-17 23:05:04 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_5 pid=2095691) INFO 04-17 23:05:04 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_7 pid=2095681) INFO 04-17 23:05:04 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_6 pid=2095499) INFO 04-17 23:05:04 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_0 pid=2095608) INFO 04-17 23:05:04 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_4 pid=2095686) INFO 04-17 23:05:04 [pynccl.py:69] vLLM is using nccl==2.21.5
NCCL version 2.21.5+cuda12.4

ctmt241129162844w28-74bfc659cd-zpz6r:2095608:2095608 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095691:2095691 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 5 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095613:2095613 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095681:2095681 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 7 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095494:2095494 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 3 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095499:2095499 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 6 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095686:2095686 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 4 and rank 0 both on CUDA device 8000
ctmt241129162844w28-74bfc659cd-zpz6r:2095532:2095532 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 8000
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 564, in __init__
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(input_path, output_path, vllm_config, executor_class,
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     self.model_executor = executor_class(vllm_config)
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     self._init_executor()
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     self.collective_rpc("init_device")
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 564, in __init__
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(input_path, output_path, vllm_config, executor_class,
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     return func(*args, **kwargs)
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     self.model_executor = executor_class(vllm_config)
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     self.worker.init_device()  # type: ignore
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     self._init_executor()
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     self.collective_rpc("init_device")
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     return func(*args, **kwargs)
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     _DP = init_model_parallel_group(group_ranks,
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     self.worker.init_device()  # type: ignore
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     return GroupCoordinator(
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 209, in __init__
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     self.device_communicator = device_comm_cls(
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     self.pynccl_comm = PyNcclCommunicator(
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 101, in __init__
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 279, in ncclCommInitRank
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     _DP = init_model_parallel_group(group_ranks,
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 258, in NCCL_CHECK
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390]     raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     return GroupCoordinator(
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 209, in __init__
(EngineCore_7 pid=2095681) ERROR 04-17 23:05:06 [core.py:390] 
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     self.device_communicator = device_comm_cls(
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     self.pynccl_comm = PyNcclCommunicator(
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 101, in __init__
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 279, in ncclCommInitRank
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 258, in NCCL_CHECK
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390]     raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_0 pid=2095608) ERROR 04-17 23:05:06 [core.py:390] 
CRITICAL 04-17 23:05:06 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-17 23:05:06 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 564, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(input_path, output_path, vllm_config, executor_class,
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     self.model_executor = executor_class(vllm_config)
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     self._init_executor()
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     self.collective_rpc("init_device")
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     return func(*args, **kwargs)
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     self.worker.init_device()  # type: ignore
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     _DP = init_model_parallel_group(group_ranks,
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     return GroupCoordinator(
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 209, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     self.device_communicator = device_comm_cls(
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     self.pynccl_comm = PyNcclCommunicator(
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 101, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 279, in ncclCommInitRank
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 564, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(input_path, output_path, vllm_config, executor_class,
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 258, in NCCL_CHECK
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390]     raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
(EngineCore_4 pid=2095686) ERROR 04-17 23:05:06 [core.py:390] 
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     self.model_executor = executor_class(vllm_config)
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 564, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     self._init_executor()
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(input_path, output_path, vllm_config, executor_class,
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     self.collective_rpc("init_device")
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     self.model_executor = executor_class(vllm_config)
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     return func(*args, **kwargs)
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     self._init_executor()
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     self.worker.init_device()  # type: ignore
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     self.collective_rpc("init_device")
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     return func(*args, **kwargs)
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     self.worker.init_device()  # type: ignore
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     _DP = init_model_parallel_group(group_ranks,
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     return GroupCoordinator(
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 209, in __init__
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     self.device_communicator = device_comm_cls(
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 564, in __init__
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 564, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     self.pynccl_comm = PyNcclCommunicator(
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(input_path, output_path, vllm_config, executor_class,
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(input_path, output_path, vllm_config, executor_class,
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 101, in __init__
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     _DP = init_model_parallel_group(group_ranks,
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 279, in ncclCommInitRank
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     return GroupCoordinator(
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     self.model_executor = executor_class(vllm_config)
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     self.model_executor = executor_class(vllm_config)
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 258, in NCCL_CHECK
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 209, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390]     raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     self.device_communicator = device_comm_cls(
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     self._init_executor()
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     self._init_executor()
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_6 pid=2095499) ERROR 04-17 23:05:06 [core.py:390] 
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     self.pynccl_comm = PyNcclCommunicator(
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     self.collective_rpc("init_device")
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 101, in __init__
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     self.collective_rpc("init_device")
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 564, in __init__
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 279, in ncclCommInitRank
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(input_path, output_path, vllm_config, executor_class,
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     return func(*args, **kwargs)
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     return func(*args, **kwargs)
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 258, in NCCL_CHECK
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     self.worker.init_device()  # type: ignore
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390]     raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     self.worker.init_device()  # type: ignore
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
CRITICAL 04-17 23:05:06 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     self.model_executor = executor_class(vllm_config)
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_2 pid=2095532) ERROR 04-17 23:05:06 [core.py:390] 
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     self._init_executor()
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     self.collective_rpc("init_device")
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     _DP = init_model_parallel_group(group_ranks,
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     _DP = init_model_parallel_group(group_ranks,
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     return GroupCoordinator(
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     return func(*args, **kwargs)
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 209, in __init__
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     return GroupCoordinator(
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     self.device_communicator = device_comm_cls(
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 209, in __init__
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     self.worker.init_device()  # type: ignore
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     self.device_communicator = device_comm_cls(
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     self.pynccl_comm = PyNcclCommunicator(
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     self.pynccl_comm = PyNcclCommunicator(
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 101, in __init__
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 101, in __init__
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 279, in ncclCommInitRank
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 279, in ncclCommInitRank
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 258, in NCCL_CHECK
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 258, in NCCL_CHECK
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390]     raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390]     raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     _DP = init_model_parallel_group(group_ranks,
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_5 pid=2095691) ERROR 04-17 23:05:06 [core.py:390] 
(EngineCore_1 pid=2095613) ERROR 04-17 23:05:06 [core.py:390] 
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     return GroupCoordinator(
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 209, in __init__
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     self.device_communicator = device_comm_cls(
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in __init__
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     self.pynccl_comm = PyNcclCommunicator(
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 101, in __init__
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 279, in ncclCommInitRank
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]   File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 258, in NCCL_CHECK
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390]     raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_3 pid=2095494) ERROR 04-17 23:05:06 [core.py:390] 
CRITICAL 04-17 23:05:06 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-17 23:05:06 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-17 23:05:06 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-17 23:05:06 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-17 23:05:06 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
ERROR:    Traceback (most recent call last):
  File "/mnt/tenant-home_speed/shard/zhangenci/.venv/trl/lib/python3.10/site-packages/starlette/routing.py", line 692, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/mnt/tenant-home_speed/shard/zhangenci/research/trl/scripts/vllm_serve.py", line 309, in lifespan
    msg = connection.recv()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

ERROR:    Application startup failed. Exiting.

qgallouedec and others added 5 commits April 17, 2025 09:55
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
@shirinyamani shirinyamani self-requested a review April 24, 2025 15:14
@qgallouedec qgallouedec changed the title 🗂️ Data Parallel for vLLM server 🗂️ Up to 4x speed-up: Data Parallel for vLLM server Apr 24, 2025
@qgallouedec qgallouedec changed the title 🗂️ Up to 4x speed-up: Data Parallel for vLLM server ⚡ Up to 4x speed-up: Data Parallel for vLLM server Apr 24, 2025
@qgallouedec
Copy link
Member Author

Training a 7B with GRPO on two nodes (1 for vLLM, one for training). Generation is way faster than before!

W B Chart 24_04_2025, 14_45_29

@qgallouedec
Copy link
Member Author

W B Chart 24_04_2025, 14_58_56

@qgallouedec qgallouedec changed the title ⚡ Up to 4x speed-up: Data Parallel for vLLM server ⚡ Up to 4x faster: Data Parallel for vLLM server Apr 24, 2025
@qgallouedec qgallouedec merged commit 36685c8 into main Apr 24, 2025
9 of 10 checks passed
@qgallouedec qgallouedec deleted the vllm-serve-dp branch April 24, 2025 22:14
@qgallouedec qgallouedec mentioned this pull request Apr 24, 2025
5 tasks
@ahatamiz
Copy link
Contributor

Hi @qgallouedec @zzzzzec

Unfortunately, I don't believe this feature works properly ! I am not able to run anything with DP>1 as I get this weird error (log for 2 nodes for the DeepSeek-R1-Distill-Qwen-7B, using trl=0.17.0 and vllm==0.8.3):

(EngineCore_1 pid=2353892) INFO 04-25 16:14:42 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_0 pid=2353893) INFO 04-25 16:14:42 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_6 pid=2353882) INFO 04-25 16:14:42 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_2 pid=2353883) INFO 04-25 16:14:42 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_3 pid=2353898) WARNING 04-25 16:14:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x155408633710>
(EngineCore_1 pid=2353892) WARNING 04-25 16:14:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x15530242e790>
(EngineCore_0 pid=2353893) WARNING 04-25 16:14:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x155302dcf8d0>
(EngineCore_7 pid=2353909) INFO 04-25 16:14:42 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_6 pid=2353882) WARNING 04-25 16:14:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x15540862f0d0>
(EngineCore_2 pid=2353883) WARNING 04-25 16:14:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x155302442110>
(EngineCore_4 pid=2353914) INFO 04-25 16:14:42 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_7 pid=2353909) WARNING 04-25 16:14:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x155554fdde90>
(EngineCore_4 pid=2353914) WARNING 04-25 16:14:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x15530242e350>
(EngineCore_5 pid=2353903) INFO 04-25 16:14:42 [worker_base.py:589] Injected <class 'trl.scripts.vllm_serve.WeightSyncWorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['close_communicator', 'init_communicator', 'update_named_param']
(EngineCore_5 pid=2353903) WARNING 04-25 16:14:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x155302431e90>
(EngineCore_6 pid=2353882) INFO 04-25 16:14:43 [parallel_state.py:836] Adjusting world_size=8 rank=6 distributed_init_method=tcp://127.0.0.1:12730 for DP
(EngineCore_1 pid=2353892) INFO 04-25 16:14:43 [parallel_state.py:836] Adjusting world_size=8 rank=1 distributed_init_method=tcp://127.0.0.1:12730 for DP
(EngineCore_0 pid=2353893) INFO 04-25 16:14:43 [parallel_state.py:836] Adjusting world_size=8 rank=0 distributed_init_method=tcp://127.0.0.1:12730 for DP
(EngineCore_4 pid=2353914) INFO 04-25 16:14:43 [parallel_state.py:836] Adjusting world_size=8 rank=4 distributed_init_method=tcp://127.0.0.1:12730 for DP
(EngineCore_2 pid=2353883) INFO 04-25 16:14:43 [parallel_state.py:836] Adjusting world_size=8 rank=2 distributed_init_method=tcp://127.0.0.1:12730 for DP
(EngineCore_5 pid=2353903) INFO 04-25 16:14:43 [parallel_state.py:836] Adjusting world_size=8 rank=5 distributed_init_method=tcp://127.0.0.1:12730 for DP
(EngineCore_7 pid=2353909) INFO 04-25 16:14:43 [parallel_state.py:836] Adjusting world_size=8 rank=7 distributed_init_method=tcp://127.0.0.1:12730 for DP
(EngineCore_3 pid=2353898) INFO 04-25 16:14:43 [parallel_state.py:836] Adjusting world_size=8 rank=3 distributed_init_method=tcp://127.0.0.1:12730 for DP
(EngineCore_6 pid=2353882) INFO 04-25 16:14:43 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_6 pid=2353882) INFO 04-25 16:14:43 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_0 pid=2353893) INFO 04-25 16:14:43 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_2 pid=2353883) INFO 04-25 16:14:43 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_0 pid=2353893) INFO 04-25 16:14:43 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_7 pid=2353909) INFO 04-25 16:14:43 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_3 pid=2353898) INFO 04-25 16:14:43 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_1 pid=2353892) INFO 04-25 16:14:43 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_4 pid=2353914) INFO 04-25 16:14:43 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_2 pid=2353883) INFO 04-25 16:14:43 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_5 pid=2353903) INFO 04-25 16:14:43 [utils.py:990] Found nccl from library libnccl.so.2
(EngineCore_7 pid=2353909) INFO 04-25 16:14:43 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_3 pid=2353898) INFO 04-25 16:14:43 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_4 pid=2353914) INFO 04-25 16:14:43 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_1 pid=2353892) INFO 04-25 16:14:43 [pynccl.py:69] vLLM is using nccl==2.21.5
(EngineCore_5 pid=2353903) INFO 04-25 16:14:43 [pynccl.py:69] vLLM is using nccl==2.21.5
NCCL version 2.21.5+cuda12.4
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 559, in init
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] super().init(input_path, output_path, vllm_config, executor_class,
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in init
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] super().init(vllm_config, executor_class, log_stats)
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 67, in init
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] self.model_executor = executor_class(vllm_config)
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] self._init_executor()
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] self.collective_rpc("init_device")
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] return func(*args, **kwargs)
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] self.worker.init_device() # type: ignore
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] _DP = init_model_parallel_group(group_ranks,
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] return GroupCoordinator(
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 209, in init
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] self.device_communicator = device_comm_cls(
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in init
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] self.pynccl_comm = PyNcclCommunicator(
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in init
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_2 pid=2353883) ERROR 04-25 16:14:44 [core.py:390]
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 559, in init
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] super().init(input_path, output_path, vllm_config, executor_class,
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in init
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] super().init(vllm_config, executor_class, log_stats)
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 67, in init
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] self.model_executor = executor_class(vllm_config)
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] self._init_executor()
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] self.collective_rpc("init_device")
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] return func(*args, **kwargs)
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] self.worker.init_device() # type: ignore
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] _DP = init_model_parallel_group(group_ranks,
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] return GroupCoordinator(
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 209, in init
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] self.device_communicator = device_comm_cls(
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in init
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] self.pynccl_comm = PyNcclCommunicator(
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in init
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_4 pid=2353914) ERROR 04-25 16:14:44 [core.py:390]
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 559, in init
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] super().init(input_path, output_path, vllm_config, executor_class,
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in init
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] super().init(vllm_config, executor_class, log_stats)
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 67, in init
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] self.model_executor = executor_class(vllm_config)
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] self._init_executor()
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] self.collective_rpc("init_device")
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] return func(*args, **kwargs)
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] self.worker.init_device() # type: ignore
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] _DP = init_model_parallel_group(group_ranks,
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] return GroupCoordinator(
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 209, in init
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] self.device_communicator = device_comm_cls(
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in init
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] self.pynccl_comm = PyNcclCommunicator(
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in init
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_1 pid=2353892) ERROR 04-25 16:14:44 [core.py:390]
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 559, in init
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] super().init(input_path, output_path, vllm_config, executor_class,
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in init
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] super().init(vllm_config, executor_class, log_stats)
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 67, in init
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] self.model_executor = executor_class(vllm_config)
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] self._init_executor()
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] self.collective_rpc("init_device")
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] return func(*args, **kwargs)
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] self.worker.init_device() # type: ignore
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] _DP = init_model_parallel_group(group_ranks,
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] return GroupCoordinator(
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 209, in init
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] self.device_communicator = device_comm_cls(
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in init
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] self.pynccl_comm = PyNcclCommunicator(
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in init
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_6 pid=2353882) ERROR 04-25 16:14:44 [core.py:390]
CRITICAL 04-25 16:14:44 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 559, in init
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] super().init(input_path, output_path, vllm_config, executor_class,
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in init
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] super().init(vllm_config, executor_class, log_stats)
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 67, in init
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] self.model_executor = executor_class(vllm_config)
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] self._init_executor()
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] self.collective_rpc("init_device")
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] return func(*args, **kwargs)
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] self.worker.init_device() # type: ignore
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] _DP = init_model_parallel_group(group_ranks,
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] return GroupCoordinator(
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 209, in init
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] self.device_communicator = device_comm_cls(
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in init
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] self.pynccl_comm = PyNcclCommunicator(
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in init
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_5 pid=2353903) ERROR 04-25 16:14:44 [core.py:390]
CRITICAL 04-25 16:14:44 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-25 16:14:44 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-25 16:14:44 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 559, in init
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] super().init(input_path, output_path, vllm_config, executor_class,
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in init
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] super().init(vllm_config, executor_class, log_stats)
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 67, in init
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] self.model_executor = executor_class(vllm_config)
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] self._init_executor()
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] self.collective_rpc("init_device")
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] return func(*args, **kwargs)
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] self.worker.init_device() # type: ignore
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] _DP = init_model_parallel_group(group_ranks,
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] return GroupCoordinator(
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 209, in init
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] self.device_communicator = device_comm_cls(
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in init
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] self.pynccl_comm = PyNcclCommunicator(
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in init
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_7 pid=2353909) ERROR 04-25 16:14:44 [core.py:390]
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 559, in init
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] super().init(input_path, output_path, vllm_config, executor_class,
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in init
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] super().init(vllm_config, executor_class, log_stats)
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 67, in init
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] self.model_executor = executor_class(vllm_config)
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] self._init_executor()
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] self.collective_rpc("init_device")
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] return func(*args, **kwargs)
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] self.worker.init_device() # type: ignore
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] _DP = init_model_parallel_group(group_ranks,
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] return GroupCoordinator(
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 209, in init
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] self.device_communicator = device_comm_cls(
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in init
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] self.pynccl_comm = PyNcclCommunicator(
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in init
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_3 pid=2353898) ERROR 04-25 16:14:44 [core.py:390]
CRITICAL 04-25 16:14:44 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 376, in run_engine_core
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 559, in init
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] super().init(input_path, output_path, vllm_config, executor_class,
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 319, in init
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] super().init(vllm_config, executor_class, log_stats)
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 67, in init
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] self._init_executor()
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] self.collective_rpc("init_device")
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] return func(*args, **kwargs)
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] self.worker.init_device() # type: ignore
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 113, in init_device
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] init_worker_distributed_environment(self.parallel_config, self.rank,
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 299, in init_worker_distributed_environment
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 995, in ensure_model_parallel_initialized
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] initialize_model_parallel(tensor_model_parallel_size,
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 952, in initialize_model_parallel
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] _DP = init_model_parallel_group(group_ranks,
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 733, in init_model_parallel_group
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] return GroupCoordinator(
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 209, in init
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] self.device_communicator = device_comm_cls(
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 39, in init
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] self.pynccl_comm = PyNcclCommunicator(
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in init
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] File "/opt/conda/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] raise RuntimeError(f"NCCL error: {error_str}")
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390] RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
(EngineCore_0 pid=2353893) ERROR 04-25 16:14:44 [core.py:390]
CRITICAL 04-25 16:14:44 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-25 16:14:44 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-25 16:14:44 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.

batch-block7-00733:2353893:2353893 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device f000

batch-block7-00733:2353892:2353892 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device f000

batch-block7-00733:2353883:2353883 [0] init.cc:943 NCCL WARN Duplicate GPU detected : rank 2 and rank 0 both on CUDA device f000
ERROR: Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/starlette/routing.py", line 692, in lifespan
async with self.lifespan_context(app) as maybe_state:
File "/opt/conda/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/trl/scripts/vllm_serve.py", line 362, in lifespan
msg = connection.recv()
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 430, in _recv_bytes
buf = self._recv(4)
^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 399, in _recv
raise EOFError
EOFError

ERROR: Application startup failed. Exiting.

@qgallouedec
Copy link
Member Author

What command do you use?

@ahatamiz
Copy link
Contributor

ahatamiz commented Apr 25, 2025

To run the vllm part, I use this:

srun \
  --nodes=1 \
  --ntasks=1 \
  --nodelist="$VLLM_NODE" \
  --container-image="$IMAGE" \
  --container-env=ALL \
  --container-mounts="/home/${USER}:/home/${USER}" \
  --container-workdir="$OUTPUT_ROOT" \
  --output="${LOGS_DIR}/vllm_%x_${DATETIME}.log" \
bash -c "
  echo \"[vLLM Node] Starting vLLM on \$(hostname -s)\"
  CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \\
    trl vllm-serve \\
      --model \"$MODEL\" \\
      --tensor_parallel_size \"$TP\" \\
      --data_parallel_size \"$DP\" \\
      --host \"$VLLM_NODE\"
"

The above works fine if DP=1 and TP set properly.

@qgallouedec
Copy link
Member Author

It could be related to this:

trl/trl/cli.py

Lines 100 to 106 in 29c5e05

if script_args.tensor_parallel_size == 1 and script_args.data_parallel_size > 1:
warnings.warn(
"Detected configuration: tensor_parallel_size=1 and data_parallel_size>1. This setup is known to "
"cause a crash when using the `trl vllm-serve` CLI entry point. As a workaround, please run the "
"server using the module path instead: `python -m trl.scripts.vllm_serve`",
RuntimeWarning,
)

Try to replace trl vllm-serve by python -m trl.scripts.vllm_serve.

@ahatamiz
Copy link
Contributor

Thanks @qgallouedec ! just tested it with TP=1 and DP=8 and it works !

@ahatamiz
Copy link
Contributor

@qgallouedec unfortunately the issue seems to persist despite seemingly being resolved at first. This time, we use TP=1 and DP=8. I'd appreciate if you may have any insights here:

Processed prompts:  25%|██▌       | 4/16 [00:47<02:21, 11.78s/it, est. speed input: 365.46 toks/s, output: 125.39 toProcessed prompts:  25%|██▌       | 4/16 [00:57<02:52, 14.40s/it, est. speed input: 281.88 toks/s, output: 112.20 toProcessed prompts:  25%|██▌       | 4/16 [00:59<02:59, 14.94s/it, est. speed input: 283.54 toks/s, output: 120.78 toProcessed prompts:  50%|█████     | 8/16 [01:04<00:54,  6.78s/it, est. speed input: 527.17 toks/s, output: 239.39 toProcessed prompts:  25%|██▌       | 4/16 [01:04<03:12, 16.01s/it, est. speed input: 263.78 toks/s, output: 127.95 toProcessed prompts:  25%|██▌       | 4/16 [01:04<03:12, 16.01s/it, est. speed input: 271.25 toks/s, output: 127.94 toProcessed prompts:  25%|██▌       | 4/16 [01:04<03:12, 16.01s/it, est. speed input: 270.38 toks/s, output: 127.91 toProcessed prompts:  50%|█████     | 8/16 [01:04<00:58,  7.34s/it, est. speed input: 532.39 toks/s, output: 218.91 toProcessed prompts:  50%|█████     | 8/16 [01:04<00:55,  6.88s/it, est. speed input: 519.56 toks/s, output: 228.68 toProcessed prompts:  25%|██▌       | 4/16 [01:04<03:12, 16.01s/it, est. speed input: 267.23 toks/s, output: 116.30 toProcessed prompts:  25%|██▌       | 4/16 [01:04<03:12, 16.01s/it, est. speed input: 265.71 toks/s, output: 127.92 toProcessed prompts: 100%|██████████| 16/16 [01:04<00:00,  4.01s/it, est. speed input: 1058.16 toks/s, output: 474.48 toks/s]
Processed prompts: 100%|██████████| 16/16 [01:04<00:00,  4.01s/it, est. speed input: 1004.50 toks/s, output: 486.24 toks/s]
Processed prompts: 100%|██████████| 16/16 [01:04<00:00,  4.00s/it, est. speed input: 1084.13 toks/s, output: 504.99 toks/s]
Processed prompts: 100%|██████████| 16/16 [01:04<00:00,  4.01s/it, est. speed input: 1069.00 toks/s, output: 509.25 toks/s]
Processed prompts: 100%|██████████| 16/16 [01:04<00:00,  4.01s/it, est. speed input: 1051.96 toks/s, output: 494.82 toks/s]
Processed prompts: 100%|██████████| 16/16 [01:04<00:00,  4.01s/it, est. speed input: 1044.32 toks/s, output: 484.14 toks/s]
Processed prompts: 100%|██████████| 16/16 [01:04<00:00,  4.00s/it, est. speed input: 1068.81 toks/s, output: 499.86 toks/s]
Processed prompts: 100%|██████████| 16/16 [01:04<00:00,  4.01s/it, est. speed input: 1060.76 toks/s, output: 511.16 toks/s]
INFO:     10.49.161.223:13970 - "POST /generate/ HTTP/1.1" 200 OK
INFO:     10.49.161.223:39872 - "POST /update_named_param/ HTTP/1.1" 200 OK
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459] Invocation of collective_rpc method failed
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459] Traceback (most recent call last):
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]   File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 456, in _handle_client_request
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]     output.result = method(
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]                     ^^^^^^^
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]   File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 303, in collective_rpc
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]     return self.model_executor.collective_rpc(method, timeout, args,
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]   File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]   File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2347, in run_method
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]     return func(*args, **kwargs)
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]   File "/opt/conda/lib/python3.11/site-packages/trl/scripts/vllm_serve.py", line 126, in update_named_param
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459]     raise RuntimeError("Communicator not initialized. Call `init_communicator` first.")
(EngineCore_7 pid=3107394) ERROR 04-25 17:45:42 [core.py:459] RuntimeError: Communicator not initialized. Call `init_communicator` first.

It seems like we do indeed finish several completions, before running into this weird error !

@qgallouedec
Copy link
Member Author

Are you using a modified version of GRPO?

@ahatamiz
Copy link
Contributor

Yes ! but it works without any issues with the previous version which is basically TP=1 and DP=1.

@qgallouedec
Copy link
Member Author

Try adding this line:

  self.vllm_client = VLLMClient(args.vllm_server_host, args.vllm_server_port, connection_timeout=args.vllm_server_timeout)
+ self.vllm_client.init_communicator()

@ahatamiz
Copy link
Contributor

Thanks @qgallouedec ! kick started another training with TP=1 and DP=8 and have not noticed any issues at least for now.

Hopefully the issue is resolved.

Thanks again for your amazing work !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants