🚀 Scaling GRPO to 70B+ Models and Multi-Node Training with vLLM Server & NCCL Communication #3094

binary-husky · 2025-03-16T16:10:31Z

Warning

The following description is outdated, please refer to the TRL doc

What does this PR do?

This PR isolates VLLM from main GRPO training process(es), using only http & NCCL to communicate with a VLLM instance.

By achieving this isolation:

(1) we can easily address almost all issues related to VLLM GPU device arrangement (simply by setting CUDA_VISIBLE_DEVICES). Such as:

[BUG] The map device of training model in GRPO includes the device used by vllm #3088
Training model in GRPO Trainer will use the vllm_device to compute logps, which causes sudden vram increases of vllm device and caused OOM error. #3086
How to support multi-device VLLM inference in the GRPO Trainer #2922

(2) we can scale to model of any size without worrying about VLLM (we are free to place VLLM on any machine as long as the training process can reach it with TCP). And addressing issues such as:

by the way, I initially came from open-r1 resp, but obviously problem cannot be resolve from there

open-r1 qwen2.5-32b train parameter open-r1#447

I have run tests on 32B models (2 accelerate nodes + 1 vllm node), so far so good.

------------------------------------------------------------------------------------------------------------
1 machine       | 4 GPUs for training, 2 GPUS for VLLM      | using NCCL to deliver param updates
------------------------------------------------------------------------------------------------------------

---
(1) start MAIN TRAINING script:
    (4 GPUs for training)
---
    CUDA_VISIBLE_DEVICES='0,1,2,3' \
    accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
        --num_processes=4 \
        grpo_with_remote_vllm.py \
        --model_name_or_path /mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2___5-7B-Instruct/ \
        --dataset_name "trl-internal-testing/zen" \
        --output_dir './mytests' \
        --bf16 \
        --use_remote_vllm=True --vllm_max_model_len 4096 --remote_vllm_num_gpus=2
---
(2) start VLLM script (do not run the commandline below, it's only a demo, the true commandline will be `printed` by the MAIN TRAINING script.):
    (2 GPUS for VLLM)
---
    CUDA_VISIBLE_DEVICES='4,5' \
    REMOTE_VLLM_INIT_MODEL='/mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2___5-7B-Instruct/' \
    REMOTE_VLLM_NCCL_LINK=True \
    REMOTE_VLLM_GPUS=2 \
    REMOTE_VLLM_GPU_FRAG=0.9 \
    REMOTE_VLLM_MAX_MODEL_LEN=4096 \
    REMOTE_VLLM_MAX_LORA_RANK=0 \   # <--- never change this, even if you use lora
    REMOTE_VLLM_TEMPERATURE=0.9 REMOTE_VLLM_NUM_GENERATION=8 \
    python3 /your/path/to/trl/extras/remote_vllm_helper.py

------------------------------------------------------------------------------------------------------------
2 machine       | 1 for training, 1 for VLLM      | using NCCL to deliver param updates
------------------------------------------------------------------------------------------------------------

---
(1) start MAIN TRAINING script:
    (on machine 1, all 8 gpus for training)
---
    CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' \
    accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
        --num_processes=8 \
        grpo_with_remote_vllm.py \
        --model_name_or_path /mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2___5-7B-Instruct/ \
        --dataset_name "trl-internal-testing/zen" \
        --output_dir './mytests' \
        --bf16 \
        --use_remote_vllm=True \
        --vllm_max_model_len 4096 \
        --remote_vllm_num_gpus=1 \
        --remote_vllm_ip_port='22.6.225.225:8000'

---
(2) start VLLM script (do not run the commandline below, it's only a demo, the true commandline will be `printed` by the MAIN TRAINING script.):
    (on machine 2, 1 GPU for VLLM)
---
    >> the commandline will be `printed` by the MAIN TRAINING script.

Current limitation

I have been working on full-para zero3 training and have not optimized zero3+lora case, therefore we can further improve the _move_model_to_remote_vllm at https://github.com/binary-husky/trl/blob/765891c5d39d4e59dca9c3f7c2da0faeeba8f7c7/trl/trainer/grpo_trainer.py#L724
Need some help for testing, finding potential bugs
Need to finish some documents on how to train 72B models with remote vllm

…cation

binary-husky · 2025-03-16T16:31:52Z

trl/extras/remote_vllm_helper.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

trl/extras/remote_vllm_helper.py

trl/trainer/grpo_trainer.py

qgallouedec · 2025-03-18T04:58:12Z

@binary-husky thank you very much for this work. It gave us a better understanding of how to achieve this.

I wanted to take a more ambitious approach and decided to refactor it further. Since this was more than I could reasonably ask to an external contributor, I took the liberty of committing the changes directly to your branch. I hope that’s okay with you!

qgallouedec · 2025-03-18T05:53:26Z

# 3094.py
from datasets import load_dataset
from trl import GRPOTrainer, GRPOConfig

dataset = load_dataset("trl-lib/tldr", split="train")


# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
    return [len(set(c)) for c in completions]


training_args = GRPOConfig(output_dir="3094", use_vllm=True, bf16=True, gradient_checkpointing=True, logging_steps=10)

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-7B",
    args=training_args,
    reward_funcs=reward_num_unique_chars,
    train_dataset=dataset,
)
trainer.train()

trl vllm-serve --model Qwen/Qwen2.5-7B --tensor_parallel_size 4

CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml 3094.py

…nary-husky/3094

Andcircle · 2025-03-25T04:24:45Z

@binary-husky @qgallouedec

Sorry I still haven't make this work, how to make 4 GPU in machine 1 for VLLM the rest 4 and the whole machine 2 for training? as stated here:

2 machine | 1 for training, 1 for VLLM | using NCCL to deliver param updates

(1) start MAIN TRAINING script:
(on machine 1, all 8 gpus for training)

CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' \
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
    --num_processes=8 \
    grpo_with_remote_vllm.py \
    --model_name_or_path /mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2___5-7B-Instruct/ \
    --dataset_name "trl-internal-testing/zen" \
    --output_dir './mytests' \
    --bf16 \
    --use_remote_vllm=True \
    --vllm_max_model_len 4096 \
    --remote_vllm_num_gpus=1 \
    --remote_vllm_ip_port='22.6.225.225:8000'

(2) start VLLM script (do not run the commandline below, it's only a demo, the true commandline will be `printed` by the MAIN TRAINING script.):
(on machine 2, 1 GPU for VLLM)

>> the commandline will be `printed` by the MAIN TRAINING script.

qgallouedec · 2025-03-25T04:42:32Z

Ignore the pr description it's an old version. Please refer to the doc

Andcircle · 2025-03-25T04:49:28Z

Ignore the pr description it's an old version. Please refer to the doc

the doc use SLURM, it only show how to use the whole node for VLLM, can we still do something like:
make 4 GPU in machine 1 for VLLM the rest 4 and the whole machine 2 for training?

binary-husky · 2025-03-25T07:40:24Z

Ignore the pr description it's an old version. Please refer to the doc

the doc use SLURM, it only show how to use the whole node for VLLM, can we still do something like: make 4 GPU in machine 1 for VLLM the rest 4 and the whole machine 2 for training?

@Andcircle You can refer to my personal notebook below for training 32B Qwen, it is ugly, not general, but may deliver some basic ideas:

# 1. Move the Model to Memory in all node🌟
# ----------------------------
#  Install rsync       #  apt install rsync tmux -y && \
#  Clear memory disk   #  rm -rf /dev/shm/targetmodel && \
#  Move the model      #  rsync -av /path/to/Qwen2___5-32B-Instruct/ /dev/shm/targetmodel
# ----------------------------

# 2. Machine 1 [eth0: 22.6.222.80] (Few GPUs) Start vLLM Service (Steps 2 and 3 can be done in any order)
#  GPU List 🌟        #    CUDA_VISIBLE_DEVICES="0,1,2,3" \
#  vLLM Serve         #    trl vllm-serve \
#  Model              #    --model /dev/shm/targetmodel \
#  Total GPUs 🌟     #    --tensor_parallel_size 4 \
#                    #    --host 0.0.0.0 --port 8000 \
#                    #    --max_model_len 8192

# 3-1. Machine 2 [eth0: 22.8.150.23] (All GPUs) Start Training Host (Steps 2 and 3 can be done in any order)
#  Change Directory   #     cd /path/to/openr1 && \
#  Virtual Env        #     source .venv/bin/activate && \
#  Clear Terminal     #     clear && \
#  GPU List           #     CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
#                    #     accelerate launch \
#  Multi-Machine Params #     --config_file recipes/accelerate_configs/zero3-multi-nodes.yaml \
#  Number of Machines #     --num_machines=2 \
#  Total GPUs         #     --num_processes=16 \
#  Main IP            #     --main_process_ip="22.8.150.23" \
#  Machine Rank        #     --machine_rank=0 \
#  Target Program      #     src/open_r1/grpo.py \
#  Training Params 🌟 #     --config recipes/Qwen2.5-32B-Instruct/grpo/learn.yaml \
#  VLLM Machine 🌟    #     --vllm_server_host 22.6.222.80

# 3-2. Machine 3 [eth0: 22.6.191.91] (All GPUs) Start Training Host (Steps 2 and 3 can be done in any order)
#  Change Directory   #     cd /path/to/openr1 && \
#  Virtual Env        #     source .venv/bin/activate && \
#  Clear Terminal     #     clear && \
#  GPU List           #     CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
#                    #     accelerate launch \
#  Multi-Machine Params #     --config_file recipes/accelerate_configs/zero3-multi-nodes.yaml \
#  Number of Machines #     --num_machines=2 \
#  Total GPUs         #     --num_processes=16 \
#  Main IP            #     --main_process_ip="22.8.150.23" \
#  Machine Rank 🌟     #     --machine_rank=1 \
#  Target Program      #     src/open_r1/grpo.py \
#  Training Params    #     --config recipes/Qwen2.5-32B-Instruct/grpo/learn.yaml \
#  VLLM Machine       #     --vllm_server_host 22.6.222.80

Andcircle · 2025-03-25T16:05:44Z

@binary-husky awesome! really appreciated!!

Andcircle · 2025-03-26T03:42:18Z

@binary-husky

I'm trying to use GPU as efficient as possible

in your above solution, in machine 1, the 0,1,2,3 used for vllm, then 4,5,6,7 can't be used for training anymore.
I'm trying to start 2 vllm, one on 0123 with port 8000, one on 4567 with port 9000
Then machine 2 will call vllm1, machine 3 call vllm2, then I can train 2 variations of model at the same time (I thought)

But actually it doesn't work, the vllm client update from machine3 will have error as following:

Any hints how should I make this setup work?

[rank0]:     trainer = GRPOTrainer(
[rank0]:               ^^^^^^^^^^^^
[rank0]:   File "/home/user/.local/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 457, in __init__
[rank0]:     self.vllm_client = VLLMClient(
[rank0]:                        ^^^^^^^^^^^
[rank0]:   File "/home/user/.local/lib/python3.11/site-packages/trl/extras/vllm_client.py", line 95, in __init__
[rank0]:     self.init_communicator()
[rank0]:   File "/home/user/.local/lib/python3.11/site-packages/trl/extras/vllm_client.py", line 215, in init_communicator
[rank0]:     self.pynccl_comm = PyNcclCommunicator(pg, device="cuda:0")
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/.local/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
[rank0]:     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
[rank0]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/.local/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
[rank0]:     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
[rank0]:   File "/home/user/.local/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
[rank0]:     raise RuntimeError(f"NCCL error: {error_str}")
[rank0]: RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)

qgallouedec · 2025-03-26T05:05:56Z

Maybe the easiest is to use 4 machines? (1 node for training, 1 for vLLM)x2

jiangix-paper · 2025-03-26T16:45:11Z

@binary-husky Great job.
I want to know, if I use containers to start multi-node grpo. Is it that I can't execute the corresponding commands on each node?
Does it looks like I have to use slurm to manage distributed training?

Andcircle · 2025-03-26T20:05:45Z

Maybe the easiest is to use 4 machines? (1 node for training, 1 for vLLM)x2

4 GPU is more than enough for vLLM, which means the rest 4 are wasted.
Unfortunately we have very limited GPU resources, that's why trying to figure this out, hahaha
Thanks anyway

binary-husky · 2025-03-27T04:51:39Z

@binary-husky

I'm trying to use GPU as efficient as possible

in your above solution, in machine 1, the 0,1,2,3 used for vllm, then 4,5,6,7 can't be used for training anymore. I'm trying to start 2 vllm, one on 0123 with port 8000, one on 4567 with port 9000 Then machine 2 will call vllm1, machine 3 call vllm2, then I can train 2 variations of model at the same time (I thought)

But actually it doesn't work, the vllm client update from machine3 will have error as following:

Any hints how should I make this setup work?

2 vllms? There are two ports you need to consider, you probably forget the other one? Please check port conflict ~

Andcircle · 2025-03-27T22:30:10Z

@binary-husky
I'm trying to use GPU as efficient as possible
in your above solution, in machine 1, the 0,1,2,3 used for vllm, then 4,5,6,7 can't be used for training anymore. I'm trying to start 2 vllm, one on 0123 with port 8000, one on 4567 with port 9000 Then machine 2 will call vllm1, machine 3 call vllm2, then I can train 2 variations of model at the same time (I thought)
But actually it doesn't work, the vllm client update from machine3 will have error as following:
Any hints how should I make this setup work?

2 vllms? There are two ports you need to consider, you probably forget the other one? Please check port conflict ~

Yeah I set this through GRPOconfig to different port.
But the error seems to say, the weights update from NCCL only support one VLLM deployment, I guess

binary-husky · 2025-03-28T03:08:14Z

Yeah I set this through GRPOconfig to different port. But the error seems to say, the weights update from NCCL only support one VLLM deployment, I guess

@Andcircle Sorry, but group port is not exposed to GRPOconfig, you have to add it manually in grpo_trainer.py, that 51216 thing.

tingkuanpei · 2025-03-28T09:58:47Z

32B model with ZeRO3 and sync_ref_model = true，will raise OOM in SyncRefModelCallback::sync_target_model().

error stack:
[rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 2611, in _inner_training_loop
[rank0]: self.control = self.callback_handler.on_step_end(args, self.state, self.control)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer_callback.py", line 535, in on_step_end
[rank0]: return self.call_event("on_step_end", args, state, control)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer_callback.py", line 557, in call_event
[rank0]: result = getattr(callback, event)(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/apps/dat/nlp/abc/local_exp_git/isa-trl/trl/trainer/callbacks.py", line 132, in on_step_end
[rank0]: self.sync_target_model(model, self.ref_model, args.ref_model_mixup_alpha)
[rank0]: File "/apps/dat/nlp/abc/local_exp_git/isa-trl/trl/trainer/callbacks.py", line 118, in sync_target_model
[rank0]: with deepspeed.zero.GatheredParameters(
[rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2224, in enter
[rank0]: self.params[0].all_gather(param_list=self.params)
[rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1143, in all_gather
[rank0]: return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _all_gather
[rank0]: self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
[rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1799, in _allgather_params_coalesced
[rank0]: flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 0 has a total capacity of 79.33 GiB of which 112.00 MiB is free. Process 529718 has 79.18 GiB memory in use. Of the allocated memory 77.62 GiB is allocated by PyTorch, and 114.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

…r & NCCL Communication (huggingface#3094) * 🚀allow GRPO to connect to VLLM in remote/local node with NCCL communication * Update trl/extras/remote_vllm_helper.py Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> * use argparse for options * add imports for remote vllm helper * formatting * fix arguments * use cli options * vllm serve * clean server * better naming * client * style * new params in generate * this method is the new default * update config * do not use asserts * update config * separate host and post * proper deprectation * deprecated arg in the vllm server * simplify moving * document host and port * style * update trainer * new generate args * update doc * Fix for zero3 * Better naming * Remove remote_vllm_helper * remove grpo_with_remote_vllm * remove cloudpickle from deps * Some consistency * Update docs/source/grpo_trainer.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update setup.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * add revision argument to vllm server * Update docs/source/grpo_trainer.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update docs/source/grpo_trainer.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Reset the prefix cache after updating weights * Update vllm_client.py * Update vllm_client.py * Update vllm_serve.py * Add health check endpoint to vLLM server * connection timeout * style * fix doc langauge hint * move reset_prefix_cache to its own endpoint * async * merge peft adaptor to send to vllm * Looks simple. Wasn't. * Peft compatibility * Update docs/source/speeding_up_training.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update docs/source/speeding_up_training.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update trl/extras/vllm_client.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * GatheredParameters can be disabled * gather and ungather peft weights within the same deepseed context * use is_vllm_available * minor consistency fixes * fix error when deepspeed is not installed * fix deepspeed import when not peft * simpler * multinode doc * minor code and comments changes * style * optional deps * vllm_server_timeout as arg * small refinement in doc * update deps * Fix VLLMClient argument in grpo_trainer; Add zero3+peft vllm transfer solution * Revert "Fix VLLMClient argument in grpo_trainer; Add zero3+peft vllm transfer solution" This reverts commit d759c9c. * log num_tokens * disable vllm test (in the future we'll add a mock for vllm server for them) * style * fix ds3_gather_for_generation --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

vamshi-rvk · 2025-03-29T06:17:02Z

@binary-husky

Ignore the pr description it's an old version. Please refer to the doc

the doc use SLURM, it only show how to use the whole node for VLLM, can we still do something like: make 4 GPU in machine 1 for VLLM the rest 4 and the whole machine 2 for training?

@Andcircle You can refer to my personal notebook below for training 32B Qwen, it is ugly, not general, but may deliver some basic ideas:

# 1. Move the Model to Memory in all node🌟
# ----------------------------
#  Install rsync       #  apt install rsync tmux -y && \
#  Clear memory disk   #  rm -rf /dev/shm/targetmodel && \
#  Move the model      #  rsync -av /path/to/Qwen2___5-32B-Instruct/ /dev/shm/targetmodel
# ----------------------------

# 2. Machine 1 [eth0: 22.6.222.80] (Few GPUs) Start vLLM Service (Steps 2 and 3 can be done in any order)
#  GPU List 🌟        #    CUDA_VISIBLE_DEVICES="0,1,2,3" \
#  vLLM Serve         #    trl vllm-serve \
#  Model              #    --model /dev/shm/targetmodel \
#  Total GPUs 🌟     #    --tensor_parallel_size 4 \
#                    #    --host 0.0.0.0 --port 8000 \
#                    #    --max_model_len 8192

# 3-1. Machine 2 [eth0: 22.8.150.23] (All GPUs) Start Training Host (Steps 2 and 3 can be done in any order)
#  Change Directory   #     cd /path/to/openr1 && \
#  Virtual Env        #     source .venv/bin/activate && \
#  Clear Terminal     #     clear && \
#  GPU List           #     CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
#                    #     accelerate launch \
#  Multi-Machine Params #     --config_file recipes/accelerate_configs/zero3-multi-nodes.yaml \
#  Number of Machines #     --num_machines=2 \
#  Total GPUs         #     --num_processes=16 \
#  Main IP            #     --main_process_ip="22.8.150.23" \
#  Machine Rank        #     --machine_rank=0 \
#  Target Program      #     src/open_r1/grpo.py \
#  Training Params 🌟 #     --config recipes/Qwen2.5-32B-Instruct/grpo/learn.yaml \
#  VLLM Machine 🌟    #     --vllm_server_host 22.6.222.80

# 3-2. Machine 3 [eth0: 22.6.191.91] (All GPUs) Start Training Host (Steps 2 and 3 can be done in any order)
#  Change Directory   #     cd /path/to/openr1 && \
#  Virtual Env        #     source .venv/bin/activate && \
#  Clear Terminal     #     clear && \
#  GPU List           #     CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
#                    #     accelerate launch \
#  Multi-Machine Params #     --config_file recipes/accelerate_configs/zero3-multi-nodes.yaml \
#  Number of Machines #     --num_machines=2 \
#  Total GPUs         #     --num_processes=16 \
#  Main IP            #     --main_process_ip="22.8.150.23" \
#  Machine Rank 🌟     #     --machine_rank=1 \
#  Target Program      #     src/open_r1/grpo.py \
#  Training Params    #     --config recipes/Qwen2.5-32B-Instruct/grpo/learn.yaml \
#  VLLM Machine       #     --vllm_server_host 22.6.222.80

@binary-husky , thanks for this.

Im trying to finetune llama 405b and it uses 16h100s (2 nodes) for vLLM and 8 nodes for training. can you provide me a similar commands config which uses 2 nodes for vllms and the rest for training? Thanks in advance.

binary-husky · 2025-03-31T04:02:45Z

32B model with ZeRO3 and sync_ref_model = true，will raise OOM in SyncRefModelCallback::sync_target_model().

error stack: [rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 2611, in _inner_training_loop [rank0]: self.control = self.callback_handler.on_step_end(args, self.state, self.control) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer_callback.py", line 535, in on_step_end [rank0]: return self.call_event("on_step_end", args, state, control) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer_callback.py", line 557, in call_event [rank0]: result = getattr(callback, event)( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/apps/dat/nlp/abc/local_exp_git/isa-trl/trl/trainer/callbacks.py", line 132, in on_step_end [rank0]: self.sync_target_model(model, self.ref_model, args.ref_model_mixup_alpha) [rank0]: File "/apps/dat/nlp/abc/local_exp_git/isa-trl/trl/trainer/callbacks.py", line 118, in sync_target_model [rank0]: with deepspeed.zero.GatheredParameters( [rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2224, in enter [rank0]: self.params[0].all_gather(param_list=self.params) [rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1143, in all_gather [rank0]: return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank0]: ret_val = func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _all_gather [rank0]: self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False) [rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1799, in _allgather_params_coalesced [rank0]: flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype, [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 0 has a total capacity of 79.33 GiB of which 112.00 MiB is free. Process 529718 has 79.18 GiB memory in use. Of the allocated memory 77.62 GiB is allocated by PyTorch, and 114.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

use this one as workaround: #3094 (comment) @tingkuanpei

binary-husky · 2025-03-31T04:05:11Z

@vamshi-rvk sorry, currently I'm unable to allocate that many machines

tongtong0613 · 2025-04-01T12:41:38Z

Ignore the pr description it's an old version. Please refer to the doc

the doc use SLURM, it only show how to use the whole node for VLLM, can we still do something like: make 4 GPU in machine 1 for VLLM the rest 4 and the whole machine 2 for training?

@Andcircle You can refer to my personal notebook below for training 32B Qwen, it is ugly, not general, but may deliver some basic ideas:

# 1. Move the Model to Memory in all node🌟
# ----------------------------
#  Install rsync       #  apt install rsync tmux -y && \
#  Clear memory disk   #  rm -rf /dev/shm/targetmodel && \
#  Move the model      #  rsync -av /path/to/Qwen2___5-32B-Instruct/ /dev/shm/targetmodel
# ----------------------------

# 2. Machine 1 [eth0: 22.6.222.80] (Few GPUs) Start vLLM Service (Steps 2 and 3 can be done in any order)
#  GPU List 🌟        #    CUDA_VISIBLE_DEVICES="0,1,2,3" \
#  vLLM Serve         #    trl vllm-serve \
#  Model              #    --model /dev/shm/targetmodel \
#  Total GPUs 🌟     #    --tensor_parallel_size 4 \
#                    #    --host 0.0.0.0 --port 8000 \
#                    #    --max_model_len 8192

# 3-1. Machine 2 [eth0: 22.8.150.23] (All GPUs) Start Training Host (Steps 2 and 3 can be done in any order)
#  Change Directory   #     cd /path/to/openr1 && \
#  Virtual Env        #     source .venv/bin/activate && \
#  Clear Terminal     #     clear && \
#  GPU List           #     CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
#                    #     accelerate launch \
#  Multi-Machine Params #     --config_file recipes/accelerate_configs/zero3-multi-nodes.yaml \
#  Number of Machines #     --num_machines=2 \
#  Total GPUs         #     --num_processes=16 \
#  Main IP            #     --main_process_ip="22.8.150.23" \
#  Machine Rank        #     --machine_rank=0 \
#  Target Program      #     src/open_r1/grpo.py \
#  Training Params 🌟 #     --config recipes/Qwen2.5-32B-Instruct/grpo/learn.yaml \
#  VLLM Machine 🌟    #     --vllm_server_host 22.6.222.80

# 3-2. Machine 3 [eth0: 22.6.191.91] (All GPUs) Start Training Host (Steps 2 and 3 can be done in any order)
#  Change Directory   #     cd /path/to/openr1 && \
#  Virtual Env        #     source .venv/bin/activate && \
#  Clear Terminal     #     clear && \
#  GPU List           #     CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
#                    #     accelerate launch \
#  Multi-Machine Params #     --config_file recipes/accelerate_configs/zero3-multi-nodes.yaml \
#  Number of Machines #     --num_machines=2 \
#  Total GPUs         #     --num_processes=16 \
#  Main IP            #     --main_process_ip="22.8.150.23" \
#  Machine Rank 🌟     #     --machine_rank=1 \
#  Target Program      #     src/open_r1/grpo.py \
#  Training Params    #     --config recipes/Qwen2.5-32B-Instruct/grpo/learn.yaml \
#  VLLM Machine       #     --vllm_server_host 22.6.222.80

@binary-husky Hello, referring to your sharing, I used the first four cards of a single H100 to start the VLLM service, while the other two H100s are used for training. However, I encountered the following error. Do you know how to solve this issue?

[Rank12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=8834, OpType=_ALLGATHER_BASE, NumelIn=1638400, NumelOut=26214400, Timeout(ms)=1800000) ran for 1800055 milliseconds before timing out.
...

binary-husky · 2025-04-14T09:13:39Z

@tongtong0613 I have seen 1800055 milliseconds error before, when I mess up reward function and make rank 0 compute reward forever. Then the watch dogs on other ranks become very unhappy...

…r & NCCL Communication (huggingface#3094) * 🚀allow GRPO to connect to VLLM in remote/local node with NCCL communication * Update trl/extras/remote_vllm_helper.py Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> * use argparse for options * add imports for remote vllm helper * formatting * fix arguments * use cli options * vllm serve * clean server * better naming * client * style * new params in generate * this method is the new default * update config * do not use asserts * update config * separate host and post * proper deprectation * deprecated arg in the vllm server * simplify moving * document host and port * style * update trainer * new generate args * update doc * Fix for zero3 * Better naming * Remove remote_vllm_helper * remove grpo_with_remote_vllm * remove cloudpickle from deps * Some consistency * Update docs/source/grpo_trainer.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update setup.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * add revision argument to vllm server * Update docs/source/grpo_trainer.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update docs/source/grpo_trainer.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Reset the prefix cache after updating weights * Update vllm_client.py * Update vllm_client.py * Update vllm_serve.py * Add health check endpoint to vLLM server * connection timeout * style * fix doc langauge hint * move reset_prefix_cache to its own endpoint * async * merge peft adaptor to send to vllm * Looks simple. Wasn't. * Peft compatibility * Update docs/source/speeding_up_training.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update docs/source/speeding_up_training.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update trl/extras/vllm_client.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * GatheredParameters can be disabled * gather and ungather peft weights within the same deepseed context * use is_vllm_available * minor consistency fixes * fix error when deepspeed is not installed * fix deepspeed import when not peft * simpler * multinode doc * minor code and comments changes * style * optional deps * vllm_server_timeout as arg * small refinement in doc * update deps * Fix VLLMClient argument in grpo_trainer; Add zero3+peft vllm transfer solution * Revert "Fix VLLMClient argument in grpo_trainer; Add zero3+peft vllm transfer solution" This reverts commit d759c9c. * log num_tokens * disable vllm test (in the future we'll add a mock for vllm server for them) * style * fix ds3_gather_for_generation --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

wadhwasahil · 2025-07-15T14:39:17Z

_move_model_to_remote_vllm - I get OOM because with peft because all parameters are gathered on a single GPU. However, without peft it works fine. Is there a way we can resolve this issue.

qgallouedec · 2025-07-15T14:48:33Z

For peft we need to merge the adapter before moving the model to vLLM. But there is currently no way to do it in a distributed manner. So the answer is, for now, there is no solution

🚀allow GRPO to connect to VLLM in remote/local node with NCCL communi…

765891c

…cation

binary-husky changed the title ~~🚀Scaling to 72B+ models by allowing GRPO to connect to VLLM in remote (or local) node with NCCL communication~~ 🚀Scaling to 32B+ models by allowing GRPO to connect to VLLM in remote (or local) node with NCCL communication Mar 16, 2025

kashif reviewed Mar 16, 2025

View reviewed changes

trl/extras/remote_vllm_helper.py Outdated Show resolved Hide resolved

Update trl/extras/remote_vllm_helper.py

42f2131

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

edbeeching reviewed Mar 17, 2025

View reviewed changes

trl/extras/remote_vllm_helper.py Outdated Show resolved Hide resolved

kashif added 3 commits March 17, 2025 10:15

use argparse for options

715d486

add imports for remote vllm helper

60a6753

formatting

f784a8c

kashif reviewed Mar 17, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

kashif added 2 commits March 17, 2025 15:28

fix arguments

5628b60

use cli options

8bfc313

lewtun mentioned this pull request Mar 17, 2025

Multi-gpu vllm inference with tensor parallelism, colocating policy model + ref model + vllm engine on the same node huggingface/open-r1#514

Open

vllm serve

d63c94a

binary-husky mentioned this pull request Mar 18, 2025

WIP "Faster" grpo trainer huggingface/open-r1#371

Closed

4 tasks

qgallouedec added 4 commits March 18, 2025 04:02

clean server

c2e970f

better naming

e50d288

client

c723685

style

5d19cf1

qgallouedec added 3 commits March 18, 2025 05:49

new params in generate

b5ff472

this method is the new default

e5fe142

update config

73853fc

qgallouedec and others added 6 commits March 17, 2025 22:57

Merge branch 'main' into main

1fbdf69

do not use asserts

94625f9

update config

9335e68

separate host and post

06aca0a

Merge branch 'main' of https://github.com/binary-husky/trl into pr/bi…

a7af2e2

…nary-husky/3094

proper deprectation

a92b296

This was referenced Mar 22, 2025

Supporting multi-vLLM inference for GRPO #2929

Closed

GRPO: OOM when init self.vllm #3128

Closed

JC-LMCO mentioned this pull request Mar 24, 2025

Support trl vllm-serve for multi-gpu vLLM inference unslothai/unsloth#2184

Open

binary-husky mentioned this pull request Mar 25, 2025

Add GRPO/ Online DPO support for quantitative models when use vllm as infer backbone. #3133

Closed

edbeeching mentioned this pull request Mar 26, 2025

how to train grpo on 2 nodes(16gpus) huggingface/open-r1#370

Closed

jhinpan mentioned this pull request Apr 8, 2025

TRL SGLang Support zhaochenyang20/Awesome-ML-SYS-Tutorial#103

Open

hjh0119 mentioned this pull request Apr 22, 2025

Decouple vLLM engine and GRPOTrainer. modelscope/ms-swift#3911

Merged

4 tasks

ryang-max mentioned this pull request Apr 27, 2025

[Feat] Suppport SGLang as rollout engine of GRPO trainer #3370

Open

8 tasks

🚀 Scaling GRPO to 70B+ Models and Multi-Node Training with vLLM Server & NCCL Communication #3094

🚀 Scaling GRPO to 70B+ Models and Multi-Node Training with vLLM Server & NCCL Communication #3094

Conversation

binary-husky commented Mar 16, 2025 • edited by qgallouedec Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Current limitation

Uh oh!

binary-husky commented Mar 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Mar 18, 2025

Uh oh!

qgallouedec commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Andcircle commented Mar 25, 2025

2 machine | 1 for training, 1 for VLLM | using NCCL to deliver param updates

(1) start MAIN TRAINING script: (on machine 1, all 8 gpus for training)

(2) start VLLM script (do not run the commandline below, it's only a demo, the true commandline will be printed by the MAIN TRAINING script.): (on machine 2, 1 GPU for VLLM)

Uh oh!

qgallouedec commented Mar 25, 2025

Uh oh!

Andcircle commented Mar 25, 2025

Uh oh!

binary-husky commented Mar 25, 2025

Uh oh!

Andcircle commented Mar 25, 2025

Uh oh!

Andcircle commented Mar 26, 2025

Uh oh!

qgallouedec commented Mar 26, 2025

Uh oh!

jiangix-paper commented Mar 26, 2025

Uh oh!

Andcircle commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

binary-husky commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Andcircle commented Mar 27, 2025

Uh oh!

binary-husky commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tingkuanpei commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vamshi-rvk commented Mar 29, 2025

Uh oh!

binary-husky commented Mar 31, 2025

Uh oh!

binary-husky commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tongtong0613 commented Apr 1, 2025

Uh oh!

binary-husky commented Apr 14, 2025

Uh oh!

wadhwasahil commented Jul 15, 2025

Uh oh!

qgallouedec commented Jul 15, 2025

Uh oh!

Uh oh!

binary-husky commented Mar 16, 2025 •

edited by qgallouedec

Loading

qgallouedec commented Mar 18, 2025 •

edited

Loading

(1) start MAIN TRAINING script:
(on machine 1, all 8 gpus for training)

(2) start VLLM script (do not run the commandline below, it's only a demo, the true commandline will be `printed` by the MAIN TRAINING script.):
(on machine 2, 1 GPU for VLLM)

Andcircle commented Mar 26, 2025 •

edited

Loading

binary-husky commented Mar 27, 2025 •

edited

Loading

binary-husky commented Mar 28, 2025 •

edited

Loading

tingkuanpei commented Mar 28, 2025 •

edited

Loading

binary-husky commented Mar 31, 2025 •

edited

Loading