Skip to content

Non-colocated failed on vllm tp>1 #595

@yuki-97

Description

@yuki-97

Error:
Same behavior as #564

raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)

Repro:
commit 3f6d52f

uv run python examples/run_grpo_math.py \
    policy.generation.colocated.enabled=false \
    policy.generation.colocated.resources.gpus_per_node=2 \
    policy.generation.vllm_cfg.tensor_parallel_size=2 \
    checkpointing.enabled=false \
    cluster.gpus_per_node=4

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions