Skip to content

🚀 Scaling GRPO to 70B+ Models and Multi-Node Training with vLLM Server & NCCL Communication #3094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 81 commits into from
Mar 21, 2025

Conversation

binary-husky
Copy link
Contributor

@binary-husky binary-husky commented Mar 16, 2025

Warning

The following description is outdated, please refer to the TRL doc

What does this PR do?

This PR isolates VLLM from main GRPO training process(es), using only http & NCCL to communicate with a VLLM instance.

By achieving this isolation:

(1) we can easily address almost all issues related to VLLM GPU device arrangement (simply by setting CUDA_VISIBLE_DEVICES). Such as:

(2) we can scale to model of any size without worrying about VLLM (we are free to place VLLM on any machine as long as the training process can reach it with TCP). And addressing issues such as:

by the way, I initially came from open-r1 resp, but obviously problem cannot be resolve from there


I have run tests on 32B models (2 accelerate nodes + 1 vllm node), so far so good.

------------------------------------------------------------------------------------------------------------
1 machine       | 4 GPUs for training, 2 GPUS for VLLM      | using NCCL to deliver param updates
------------------------------------------------------------------------------------------------------------

---
(1) start MAIN TRAINING script:
    (4 GPUs for training)
---
    CUDA_VISIBLE_DEVICES='0,1,2,3' \
    accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
        --num_processes=4 \
        grpo_with_remote_vllm.py \
        --model_name_or_path /mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2___5-7B-Instruct/ \
        --dataset_name "trl-internal-testing/zen" \
        --output_dir './mytests' \
        --bf16 \
        --use_remote_vllm=True --vllm_max_model_len 4096 --remote_vllm_num_gpus=2
---
(2) start VLLM script (do not run the commandline below, it's only a demo, the true commandline will be `printed` by the MAIN TRAINING script.):
    (2 GPUS for VLLM)
---
    CUDA_VISIBLE_DEVICES='4,5' \
    REMOTE_VLLM_INIT_MODEL='/mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2___5-7B-Instruct/' \
    REMOTE_VLLM_NCCL_LINK=True \
    REMOTE_VLLM_GPUS=2 \
    REMOTE_VLLM_GPU_FRAG=0.9 \
    REMOTE_VLLM_MAX_MODEL_LEN=4096 \
    REMOTE_VLLM_MAX_LORA_RANK=0 \   # <--- never change this, even if you use lora
    REMOTE_VLLM_TEMPERATURE=0.9 REMOTE_VLLM_NUM_GENERATION=8 \
    python3 /your/path/to/trl/extras/remote_vllm_helper.py


------------------------------------------------------------------------------------------------------------
2 machine       | 1 for training, 1 for VLLM      | using NCCL to deliver param updates
------------------------------------------------------------------------------------------------------------

---
(1) start MAIN TRAINING script:
    (on machine 1, all 8 gpus for training)
---
    CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' \
    accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
        --num_processes=8 \
        grpo_with_remote_vllm.py \
        --model_name_or_path /mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2___5-7B-Instruct/ \
        --dataset_name "trl-internal-testing/zen" \
        --output_dir './mytests' \
        --bf16 \
        --use_remote_vllm=True \
        --vllm_max_model_len 4096 \
        --remote_vllm_num_gpus=1 \
        --remote_vllm_ip_port='22.6.225.225:8000'

---
(2) start VLLM script (do not run the commandline below, it's only a demo, the true commandline will be `printed` by the MAIN TRAINING script.):
    (on machine 2, 1 GPU for VLLM)
---
    >> the commandline will be `printed` by the MAIN TRAINING script.

Current limitation

@binary-husky binary-husky changed the title 🚀Scaling to 72B+ models by allowing GRPO to connect to VLLM in remote (or local) node with NCCL communication 🚀Scaling to 32B+ models by allowing GRPO to connect to VLLM in remote (or local) node with NCCL communication Mar 16, 2025
@binary-husky
Copy link
Contributor Author

image

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
@qgallouedec
Copy link
Member

@binary-husky thank you very much for this work. It gave us a better understanding of how to achieve this.

I wanted to take a more ambitious approach and decided to refactor it further. Since this was more than I could reasonably ask to an external contributor, I took the liberty of committing the changes directly to your branch. I hope that’s okay with you!

@qgallouedec
Copy link
Member

qgallouedec commented Mar 18, 2025

# 3094.py
from datasets import load_dataset
from trl import GRPOTrainer, GRPOConfig

dataset = load_dataset("trl-lib/tldr", split="train")


# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
    return [len(set(c)) for c in completions]


training_args = GRPOConfig(output_dir="3094", use_vllm=True, bf16=True, gradient_checkpointing=True, logging_steps=10)

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-7B",
    args=training_args,
    reward_funcs=reward_num_unique_chars,
    train_dataset=dataset,
)
trainer.train()
trl vllm-serve --model Qwen/Qwen2.5-7B --tensor_parallel_size 4
CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml 3094.py 

@Andcircle
Copy link

@binary-husky @qgallouedec

Sorry I still haven't make this work, how to make 4 GPU in machine 1 for VLLM the rest 4 and the whole machine 2 for training? as stated here:


2 machine | 1 for training, 1 for VLLM | using NCCL to deliver param updates


(1) start MAIN TRAINING script:
(on machine 1, all 8 gpus for training)

CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' \
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
    --num_processes=8 \
    grpo_with_remote_vllm.py \
    --model_name_or_path /mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2___5-7B-Instruct/ \
    --dataset_name "trl-internal-testing/zen" \
    --output_dir './mytests' \
    --bf16 \
    --use_remote_vllm=True \
    --vllm_max_model_len 4096 \
    --remote_vllm_num_gpus=1 \
    --remote_vllm_ip_port='22.6.225.225:8000'

(2) start VLLM script (do not run the commandline below, it's only a demo, the true commandline will be printed by the MAIN TRAINING script.):
(on machine 2, 1 GPU for VLLM)

>> the commandline will be `printed` by the MAIN TRAINING script.

@qgallouedec
Copy link
Member

Ignore the pr description it's an old version. Please refer to the doc

@Andcircle
Copy link

Ignore the pr description it's an old version. Please refer to the doc

the doc use SLURM, it only show how to use the whole node for VLLM, can we still do something like:
make 4 GPU in machine 1 for VLLM the rest 4 and the whole machine 2 for training?

@binary-husky
Copy link
Contributor Author

Ignore the pr description it's an old version. Please refer to the doc

the doc use SLURM, it only show how to use the whole node for VLLM, can we still do something like: make 4 GPU in machine 1 for VLLM the rest 4 and the whole machine 2 for training?

@Andcircle You can refer to my personal notebook below for training 32B Qwen, it is ugly, not general, but may deliver some basic ideas:

# 1. Move the Model to Memory in all node🌟
# ----------------------------
#  Install rsync       #  apt install rsync tmux -y && \
#  Clear memory disk   #  rm -rf /dev/shm/targetmodel && \
#  Move the model      #  rsync -av /path/to/Qwen2___5-32B-Instruct/ /dev/shm/targetmodel
# ----------------------------

# 2. Machine 1 [eth0: 22.6.222.80] (Few GPUs) Start vLLM Service (Steps 2 and 3 can be done in any order)
#  GPU List 🌟        #    CUDA_VISIBLE_DEVICES="0,1,2,3" \
#  vLLM Serve         #    trl vllm-serve \
#  Model              #    --model /dev/shm/targetmodel \
#  Total GPUs 🌟     #    --tensor_parallel_size 4 \
#                    #    --host 0.0.0.0 --port 8000 \
#                    #    --max_model_len 8192

# 3-1. Machine 2 [eth0: 22.8.150.23] (All GPUs) Start Training Host (Steps 2 and 3 can be done in any order)
#  Change Directory   #     cd /path/to/openr1 && \
#  Virtual Env        #     source .venv/bin/activate && \
#  Clear Terminal     #     clear && \
#  GPU List           #     CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
#                    #     accelerate launch \
#  Multi-Machine Params #     --config_file recipes/accelerate_configs/zero3-multi-nodes.yaml \
#  Number of Machines #     --num_machines=2 \
#  Total GPUs         #     --num_processes=16 \
#  Main IP            #     --main_process_ip="22.8.150.23" \
#  Machine Rank        #     --machine_rank=0 \
#  Target Program      #     src/open_r1/grpo.py \
#  Training Params 🌟 #     --config recipes/Qwen2.5-32B-Instruct/grpo/learn.yaml \
#  VLLM Machine 🌟    #     --vllm_server_host 22.6.222.80

# 3-2. Machine 3 [eth0: 22.6.191.91] (All GPUs) Start Training Host (Steps 2 and 3 can be done in any order)
#  Change Directory   #     cd /path/to/openr1 && \
#  Virtual Env        #     source .venv/bin/activate && \
#  Clear Terminal     #     clear && \
#  GPU List           #     CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
#                    #     accelerate launch \
#  Multi-Machine Params #     --config_file recipes/accelerate_configs/zero3-multi-nodes.yaml \
#  Number of Machines #     --num_machines=2 \
#  Total GPUs         #     --num_processes=16 \
#  Main IP            #     --main_process_ip="22.8.150.23" \
#  Machine Rank 🌟     #     --machine_rank=1 \
#  Target Program      #     src/open_r1/grpo.py \
#  Training Params    #     --config recipes/Qwen2.5-32B-Instruct/grpo/learn.yaml \
#  VLLM Machine       #     --vllm_server_host 22.6.222.80

@Andcircle
Copy link

@binary-husky awesome! really appreciated!!

@Andcircle
Copy link

@binary-husky

I'm trying to use GPU as efficient as possible

in your above solution, in machine 1, the 0,1,2,3 used for vllm, then 4,5,6,7 can't be used for training anymore.
I'm trying to start 2 vllm, one on 0123 with port 8000, one on 4567 with port 9000
Then machine 2 will call vllm1, machine 3 call vllm2, then I can train 2 variations of model at the same time (I thought)

But actually it doesn't work, the vllm client update from machine3 will have error as following:

Any hints how should I make this setup work?

[rank0]:     trainer = GRPOTrainer(
[rank0]:               ^^^^^^^^^^^^
[rank0]:   File "/home/user/.local/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 457, in __init__
[rank0]:     self.vllm_client = VLLMClient(
[rank0]:                        ^^^^^^^^^^^
[rank0]:   File "/home/user/.local/lib/python3.11/site-packages/trl/extras/vllm_client.py", line 95, in __init__
[rank0]:     self.init_communicator()
[rank0]:   File "/home/user/.local/lib/python3.11/site-packages/trl/extras/vllm_client.py", line 215, in init_communicator
[rank0]:     self.pynccl_comm = PyNcclCommunicator(pg, device="cuda:0")
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/.local/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
[rank0]:     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
[rank0]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/.local/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
[rank0]:     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
[rank0]:   File "/home/user/.local/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
[rank0]:     raise RuntimeError(f"NCCL error: {error_str}")
[rank0]: RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)

@qgallouedec
Copy link
Member

Maybe the easiest is to use 4 machines? (1 node for training, 1 for vLLM)x2

@jiangix-paper
Copy link

@binary-husky Great job.
I want to know, if I use containers to start multi-node grpo. Is it that I can't execute the corresponding commands on each node?
Does it looks like I have to use slurm to manage distributed training?

@Andcircle
Copy link

Andcircle commented Mar 26, 2025

Maybe the easiest is to use 4 machines? (1 node for training, 1 for vLLM)x2

4 GPU is more than enough for vLLM, which means the rest 4 are wasted.
Unfortunately we have very limited GPU resources, that's why trying to figure this out, hahaha
Thanks anyway

@binary-husky
Copy link
Contributor Author

binary-husky commented Mar 27, 2025

@binary-husky

I'm trying to use GPU as efficient as possible

in your above solution, in machine 1, the 0,1,2,3 used for vllm, then 4,5,6,7 can't be used for training anymore. I'm trying to start 2 vllm, one on 0123 with port 8000, one on 4567 with port 9000 Then machine 2 will call vllm1, machine 3 call vllm2, then I can train 2 variations of model at the same time (I thought)

But actually it doesn't work, the vllm client update from machine3 will have error as following:

Any hints how should I make this setup work?

2 vllms? There are two ports you need to consider, you probably forget the other one? Please check port conflict ~

image

@Andcircle
Copy link

@binary-husky
I'm trying to use GPU as efficient as possible
in your above solution, in machine 1, the 0,1,2,3 used for vllm, then 4,5,6,7 can't be used for training anymore. I'm trying to start 2 vllm, one on 0123 with port 8000, one on 4567 with port 9000 Then machine 2 will call vllm1, machine 3 call vllm2, then I can train 2 variations of model at the same time (I thought)
But actually it doesn't work, the vllm client update from machine3 will have error as following:
Any hints how should I make this setup work?

2 vllms? There are two ports you need to consider, you probably forget the other one? Please check port conflict ~

image

Yeah I set this through GRPOconfig to different port.
But the error seems to say, the weights update from NCCL only support one VLLM deployment, I guess

@binary-husky
Copy link
Contributor Author

binary-husky commented Mar 28, 2025

Yeah I set this through GRPOconfig to different port. But the error seems to say, the weights update from NCCL only support one VLLM deployment, I guess

@Andcircle Sorry, but group port is not exposed to GRPOconfig, you have to add it manually in grpo_trainer.py, that 51216 thing.

@tingkuanpei
Copy link

tingkuanpei commented Mar 28, 2025

32B model with ZeRO3 and sync_ref_model = true,will raise OOM in SyncRefModelCallback::sync_target_model().

error stack:
[rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 2611, in _inner_training_loop
[rank0]: self.control = self.callback_handler.on_step_end(args, self.state, self.control)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer_callback.py", line 535, in on_step_end
[rank0]: return self.call_event("on_step_end", args, state, control)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer_callback.py", line 557, in call_event
[rank0]: result = getattr(callback, event)(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/apps/dat/nlp/abc/local_exp_git/isa-trl/trl/trainer/callbacks.py", line 132, in on_step_end
[rank0]: self.sync_target_model(model, self.ref_model, args.ref_model_mixup_alpha)
[rank0]: File "/apps/dat/nlp/abc/local_exp_git/isa-trl/trl/trainer/callbacks.py", line 118, in sync_target_model
[rank0]: with deepspeed.zero.GatheredParameters(
[rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2224, in enter
[rank0]: self.params[0].all_gather(param_list=self.params)
[rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1143, in all_gather
[rank0]: return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _all_gather
[rank0]: self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
[rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1799, in _allgather_params_coalesced
[rank0]: flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 0 has a total capacity of 79.33 GiB of which 112.00 MiB is free. Process 529718 has 79.18 GiB memory in use. Of the allocated memory 77.62 GiB is allocated by PyTorch, and 114.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

kashif added a commit to kashif/trl that referenced this pull request Mar 28, 2025
…r & NCCL Communication (huggingface#3094)

* 🚀allow GRPO to connect to VLLM in remote/local node with NCCL communication

* Update trl/extras/remote_vllm_helper.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* use argparse for options

* add  imports for remote vllm helper

* formatting

* fix arguments

* use cli options

* vllm serve

* clean server

* better naming

* client

* style

* new params in generate

* this method is the new default

* update config

* do not use asserts

* update config

* separate host and post

* proper deprectation

* deprecated arg in the vllm server

* simplify moving

* document host and port

* style

* update trainer

* new generate args

* update doc

* Fix for zero3

* Better naming

* Remove remote_vllm_helper

* remove grpo_with_remote_vllm

* remove cloudpickle from deps

* Some consistency

* Update docs/source/grpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update setup.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* add revision argument to vllm server

* Update docs/source/grpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/grpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Reset the prefix cache after updating weights

* Update vllm_client.py

* Update vllm_client.py

* Update vllm_serve.py

* Add health check endpoint to vLLM server

* connection timeout

* style

* fix doc langauge hint

* move reset_prefix_cache to its own endpoint

* async

* merge peft adaptor to send to vllm

* Looks simple. Wasn't.

* Peft compatibility

* Update docs/source/speeding_up_training.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/speeding_up_training.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/extras/vllm_client.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* GatheredParameters can be disabled

* gather and ungather peft weights within the same deepseed context

* use is_vllm_available

* minor consistency fixes

* fix error when deepspeed is not installed

* fix deepspeed import when not peft

* simpler

* multinode doc

* minor code and comments changes

* style

* optional deps

* vllm_server_timeout as arg

* small refinement in doc

* update deps

* Fix VLLMClient argument in grpo_trainer; Add zero3+peft vllm transfer solution

* Revert "Fix VLLMClient argument in grpo_trainer; Add zero3+peft vllm transfer solution"

This reverts commit d759c9c.

* log num_tokens

* disable vllm test (in the future we'll add a mock for vllm server for them)

* style

* fix ds3_gather_for_generation

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
@vamshi-rvk
Copy link

@binary-husky

Ignore the pr description it's an old version. Please refer to the doc

the doc use SLURM, it only show how to use the whole node for VLLM, can we still do something like: make 4 GPU in machine 1 for VLLM the rest 4 and the whole machine 2 for training?

@Andcircle You can refer to my personal notebook below for training 32B Qwen, it is ugly, not general, but may deliver some basic ideas:

# 1. Move the Model to Memory in all node🌟
# ----------------------------
#  Install rsync       #  apt install rsync tmux -y && \
#  Clear memory disk   #  rm -rf /dev/shm/targetmodel && \
#  Move the model      #  rsync -av /path/to/Qwen2___5-32B-Instruct/ /dev/shm/targetmodel
# ----------------------------

# 2. Machine 1 [eth0: 22.6.222.80] (Few GPUs) Start vLLM Service (Steps 2 and 3 can be done in any order)
#  GPU List 🌟        #    CUDA_VISIBLE_DEVICES="0,1,2,3" \
#  vLLM Serve         #    trl vllm-serve \
#  Model              #    --model /dev/shm/targetmodel \
#  Total GPUs 🌟     #    --tensor_parallel_size 4 \
#                    #    --host 0.0.0.0 --port 8000 \
#                    #    --max_model_len 8192

# 3-1. Machine 2 [eth0: 22.8.150.23] (All GPUs) Start Training Host (Steps 2 and 3 can be done in any order)
#  Change Directory   #     cd /path/to/openr1 && \
#  Virtual Env        #     source .venv/bin/activate && \
#  Clear Terminal     #     clear && \
#  GPU List           #     CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
#                    #     accelerate launch \
#  Multi-Machine Params #     --config_file recipes/accelerate_configs/zero3-multi-nodes.yaml \
#  Number of Machines #     --num_machines=2 \
#  Total GPUs         #     --num_processes=16 \
#  Main IP            #     --main_process_ip="22.8.150.23" \
#  Machine Rank        #     --machine_rank=0 \
#  Target Program      #     src/open_r1/grpo.py \
#  Training Params 🌟 #     --config recipes/Qwen2.5-32B-Instruct/grpo/learn.yaml \
#  VLLM Machine 🌟    #     --vllm_server_host 22.6.222.80

# 3-2. Machine 3 [eth0: 22.6.191.91] (All GPUs) Start Training Host (Steps 2 and 3 can be done in any order)
#  Change Directory   #     cd /path/to/openr1 && \
#  Virtual Env        #     source .venv/bin/activate && \
#  Clear Terminal     #     clear && \
#  GPU List           #     CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
#                    #     accelerate launch \
#  Multi-Machine Params #     --config_file recipes/accelerate_configs/zero3-multi-nodes.yaml \
#  Number of Machines #     --num_machines=2 \
#  Total GPUs         #     --num_processes=16 \
#  Main IP            #     --main_process_ip="22.8.150.23" \
#  Machine Rank 🌟     #     --machine_rank=1 \
#  Target Program      #     src/open_r1/grpo.py \
#  Training Params    #     --config recipes/Qwen2.5-32B-Instruct/grpo/learn.yaml \
#  VLLM Machine       #     --vllm_server_host 22.6.222.80

@binary-husky , thanks for this.

Im trying to finetune llama 405b and it uses 16h100s (2 nodes) for vLLM and 8 nodes for training. can you provide me a similar commands config which uses 2 nodes for vllms and the rest for training? Thanks in advance.

@binary-husky
Copy link
Contributor Author

32B model with ZeRO3 and sync_ref_model = true,will raise OOM in SyncRefModelCallback::sync_target_model().

error stack: [rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 2611, in _inner_training_loop [rank0]: self.control = self.callback_handler.on_step_end(args, self.state, self.control) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer_callback.py", line 535, in on_step_end [rank0]: return self.call_event("on_step_end", args, state, control) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer_callback.py", line 557, in call_event [rank0]: result = getattr(callback, event)( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/apps/dat/nlp/abc/local_exp_git/isa-trl/trl/trainer/callbacks.py", line 132, in on_step_end [rank0]: self.sync_target_model(model, self.ref_model, args.ref_model_mixup_alpha) [rank0]: File "/apps/dat/nlp/abc/local_exp_git/isa-trl/trl/trainer/callbacks.py", line 118, in sync_target_model [rank0]: with deepspeed.zero.GatheredParameters( [rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2224, in enter [rank0]: self.params[0].all_gather(param_list=self.params) [rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1143, in all_gather [rank0]: return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank0]: ret_val = func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _all_gather [rank0]: self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False) [rank0]: File "/usr/local/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1799, in _allgather_params_coalesced [rank0]: flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype, [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 0 has a total capacity of 79.33 GiB of which 112.00 MiB is free. Process 529718 has 79.18 GiB memory in use. Of the allocated memory 77.62 GiB is allocated by PyTorch, and 114.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

use this one as workaround: #3094 (comment) @tingkuanpei

@binary-husky
Copy link
Contributor Author

binary-husky commented Mar 31, 2025

@vamshi-rvk sorry, currently I'm unable to allocate that many machines

@tongtong0613
Copy link

Ignore the pr description it's an old version. Please refer to the doc

the doc use SLURM, it only show how to use the whole node for VLLM, can we still do something like: make 4 GPU in machine 1 for VLLM the rest 4 and the whole machine 2 for training?

@Andcircle You can refer to my personal notebook below for training 32B Qwen, it is ugly, not general, but may deliver some basic ideas:

# 1. Move the Model to Memory in all node🌟
# ----------------------------
#  Install rsync       #  apt install rsync tmux -y && \
#  Clear memory disk   #  rm -rf /dev/shm/targetmodel && \
#  Move the model      #  rsync -av /path/to/Qwen2___5-32B-Instruct/ /dev/shm/targetmodel
# ----------------------------

# 2. Machine 1 [eth0: 22.6.222.80] (Few GPUs) Start vLLM Service (Steps 2 and 3 can be done in any order)
#  GPU List 🌟        #    CUDA_VISIBLE_DEVICES="0,1,2,3" \
#  vLLM Serve         #    trl vllm-serve \
#  Model              #    --model /dev/shm/targetmodel \
#  Total GPUs 🌟     #    --tensor_parallel_size 4 \
#                    #    --host 0.0.0.0 --port 8000 \
#                    #    --max_model_len 8192

# 3-1. Machine 2 [eth0: 22.8.150.23] (All GPUs) Start Training Host (Steps 2 and 3 can be done in any order)
#  Change Directory   #     cd /path/to/openr1 && \
#  Virtual Env        #     source .venv/bin/activate && \
#  Clear Terminal     #     clear && \
#  GPU List           #     CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
#                    #     accelerate launch \
#  Multi-Machine Params #     --config_file recipes/accelerate_configs/zero3-multi-nodes.yaml \
#  Number of Machines #     --num_machines=2 \
#  Total GPUs         #     --num_processes=16 \
#  Main IP            #     --main_process_ip="22.8.150.23" \
#  Machine Rank        #     --machine_rank=0 \
#  Target Program      #     src/open_r1/grpo.py \
#  Training Params 🌟 #     --config recipes/Qwen2.5-32B-Instruct/grpo/learn.yaml \
#  VLLM Machine 🌟    #     --vllm_server_host 22.6.222.80

# 3-2. Machine 3 [eth0: 22.6.191.91] (All GPUs) Start Training Host (Steps 2 and 3 can be done in any order)
#  Change Directory   #     cd /path/to/openr1 && \
#  Virtual Env        #     source .venv/bin/activate && \
#  Clear Terminal     #     clear && \
#  GPU List           #     CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
#                    #     accelerate launch \
#  Multi-Machine Params #     --config_file recipes/accelerate_configs/zero3-multi-nodes.yaml \
#  Number of Machines #     --num_machines=2 \
#  Total GPUs         #     --num_processes=16 \
#  Main IP            #     --main_process_ip="22.8.150.23" \
#  Machine Rank 🌟     #     --machine_rank=1 \
#  Target Program      #     src/open_r1/grpo.py \
#  Training Params    #     --config recipes/Qwen2.5-32B-Instruct/grpo/learn.yaml \
#  VLLM Machine       #     --vllm_server_host 22.6.222.80

@binary-husky Hello, referring to your sharing, I used the first four cards of a single H100 to start the VLLM service, while the other two H100s are used for training. However, I encountered the following error. Do you know how to solve this issue?

[Rank12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=8834, OpType=_ALLGATHER_BASE, NumelIn=1638400, NumelOut=26214400, Timeout(ms)=1800000) ran for 1800055 milliseconds before timing out.
...

@binary-husky
Copy link
Contributor Author

@tongtong0613 I have seen 1800055 milliseconds error before, when I mess up reward function and make rank 0 compute reward forever. Then the watch dogs on other ranks become very unhappy...

yxliu-TAMU pushed a commit to mincheolseong/ECEN743-GRPO-Project-Proposal that referenced this pull request Apr 20, 2025
…r & NCCL Communication (huggingface#3094)

* 🚀allow GRPO to connect to VLLM in remote/local node with NCCL communication

* Update trl/extras/remote_vllm_helper.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* use argparse for options

* add  imports for remote vllm helper

* formatting

* fix arguments

* use cli options

* vllm serve

* clean server

* better naming

* client

* style

* new params in generate

* this method is the new default

* update config

* do not use asserts

* update config

* separate host and post

* proper deprectation

* deprecated arg in the vllm server

* simplify moving

* document host and port

* style

* update trainer

* new generate args

* update doc

* Fix for zero3

* Better naming

* Remove remote_vllm_helper

* remove grpo_with_remote_vllm

* remove cloudpickle from deps

* Some consistency

* Update docs/source/grpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update setup.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* add revision argument to vllm server

* Update docs/source/grpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/grpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Reset the prefix cache after updating weights

* Update vllm_client.py

* Update vllm_client.py

* Update vllm_serve.py

* Add health check endpoint to vLLM server

* connection timeout

* style

* fix doc langauge hint

* move reset_prefix_cache to its own endpoint

* async

* merge peft adaptor to send to vllm

* Looks simple. Wasn't.

* Peft compatibility

* Update docs/source/speeding_up_training.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/speeding_up_training.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/extras/vllm_client.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* GatheredParameters can be disabled

* gather and ungather peft weights within the same deepseed context

* use is_vllm_available

* minor consistency fixes

* fix error when deepspeed is not installed

* fix deepspeed import when not peft

* simpler

* multinode doc

* minor code and comments changes

* style

* optional deps

* vllm_server_timeout as arg

* small refinement in doc

* update deps

* Fix VLLMClient argument in grpo_trainer; Add zero3+peft vllm transfer solution

* Revert "Fix VLLMClient argument in grpo_trainer; Add zero3+peft vllm transfer solution"

This reverts commit d759c9c.

* log num_tokens

* disable vllm test (in the future we'll add a mock for vllm server for them)

* style

* fix ds3_gather_for_generation

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
@wadhwasahil
Copy link

_move_model_to_remote_vllm - I get OOM because with peft because all parameters are gathered on a single GPU. However, without peft it works fine. Is there a way we can resolve this issue.

@qgallouedec
Copy link
Member

For peft we need to merge the adapter before moving the model to vLLM. But there is currently no way to do it in a distributed manner. So the answer is, for now, there is no solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.