🧑‍🤝‍🧑 Co-Locating vLLM w/ training to for higher throughput and GPU utilization #3394

toslali-ibm · 2025-04-30T14:06:38Z

What does this PR do?

Enables colocating vLLM with training in each GPU to improve utilization and throughput.

Fixes #3064 and #3113
Addresses: #3195, #2971, #2922, #2887 etc.

Enabler:

vLLM (version >0.7.3) introduced support for an external launcher, allowing vLLM processes to run alongside other workloads on the same GPU.

Benefits:

Faster Inference: Speeds up GRPO training by reducing inference latency via parallel prompt processing (each vLLM works on their device's batch)
Better GPU Efficiency: Frees up GPU resources by removing the need for a dedicated vLLM server. Multiple vLLM instances can now share GPUs with training jobs (reducing GPU idle time)
Supports TP + DP
Ray-less solution

Testing vllm colocation

Run it w/ the following:
VLLM_USE_V1=0 ACCELERATE_LOG_LEVEL=info CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --config_file recipes/accelerate_configs/zero3.yaml --num_processes=8 -m open_r1.grpo --config config_tpcoloc.yaml

change vllm_colocation in the config to the sharding you would like.
E.g., If vllm_colocation=1, model is not sharded, each GPU holds a full copy of the model.
vllm_colocation=2, model is sharded by two, and groups: [0,1], [2,3], [4,5], [6,7].

Click to view config.yaml

# Model arguments
model_name_or_path: Qwen/Qwen2.5-Math-1.5B
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data training arguments
dataset_name: DigitalLearningGmbH/MATH-lighteval
dataset_config: default
dataset_prompt_column: problem
system_prompt: "You are a helpful AI Assistant, designed to provided well-reasoned and detailed responses. You FIRST think about the reasoning process as an internal monologue and then provide the user with the answer. The reasoning process MUST BE enclosed within <think> and </think> tags."

# GRPO trainer config
bf16: true
use_vllm: true
vllm_colocation: 2
vllm_gpu_memory_utilization: 0.3
vllm_max_model_len: 2048
do_eval: false
gradient_accumulation_steps: 1
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
learning_rate: 3.0e-06
log_completions: false
log_level: info
logging_first_step: true
logging_steps: 1
logging_strategy: steps
lr_scheduler_type: cosine
max_prompt_length: 512
max_completion_length: 1024
max_steps: 50
num_generations: 8
num_train_epochs: 1
overwrite_output_dir: true
# per_device_eval_batch_size: 16
per_device_train_batch_size: 16
push_to_hub: false
report_to:
- wandb
reward_funcs:
- accuracy
- format
reward_weights:
- 1.0
- 1.0
save_strategy: steps
save_steps: 100
save_total_limit: 1
seed: 42
warmup_ratio: 0.1

Sanity check

GRPO results Qwen/Qwen2.5-Math-1.5B on DigitalLearningGmbH/MATH-lighteval dataset (as shown above) using both plain TRL (w/ vLLM server) and colocated TRL (w/ TP =1,TP =2, and TP =4); The rewards are identical.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

New version of #3162 (incorporated @qgallouedec 's comments)

CC @fabianlim

trl/trainer/grpo_trainer.py

qgallouedec · 2025-05-01T18:49:16Z

@toslali-ibm I've updated your PR by changing the logic a bit, so I'll let you have a look, test it out, and tell me what you think.

toslali-ibm · 2025-05-01T18:50:37Z

@toslali-ibm I've updated your PR by changing the logic a bit, so I'll let you have a look, test it out, and tell me what you think.

I think it looks good. Let me run a sanity-check experiment.

qgallouedec · 2025-05-01T19:00:42Z

trl/trainer/grpo_trainer.py

+                    torch.distributed.all_gather_object(gathered_prompts, prompts_text, group=self.tp_group)
+                    prompts_text = [p for sublist in gathered_prompts for p in sublist]
+
+                all_outputs = self.llm.generate(prompts_text, sampling_params=sampling_params, use_tqdm=False)


In a multi-GPUsetup (2 GPUs) with TP=2, if each rank is given a separate subset of prompts—e.g., rank 0 gets ["a", "b"] and rank 1 gets ["c", "d"]. Does each rank independently call:

llm.generate(["a", "b", "c", "d"])

It seems like duplicated call, but is it coordinated such that each rank only processes its subset of prompts? In other words, if the full prompt list is passed on each rank, does vLLM handle this duplication internally to avoid redundant work?

Yes. When using TP along with external_launcher, we need to make sure that all participating shards receive the same prompts -- and vLLM internally handles it.

So if TP = 2 and GPU = 2, then all workers get the ["a", "b", "c", "d"]

So if TP = 1 and GPU = 2, then first worker get the ["a", "b"] and second worker get the ["c", "d"]

toslali-ibm · 2025-05-01T19:07:22Z

I am getting an error from the current version of the code

  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/workspace/open-r1/src/open_r1/grpo.py", line 179, in <module>
    main(script_args, training_args, model_args)
  File "/workspace/open-r1/src/open_r1/grpo.py", line 133, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2238, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2553, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 3724, in training_step
    inputs = self._prepare_inputs(inputs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/trl/trl/extras/profiling.py", line 87, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/trl/trl/trainer/grpo_trainer.py", line 991, in _prepare_inputs
    accumulated_local_batch = self._generate_and_score_completions(accumulated_local_batch)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/trl/trl/trainer/grpo_trainer.py", line 1094, in _generate_and_score_completions
    completion_ids = [torch.tensor(ids, device=device) for ids in completion_ids]```

qgallouedec · 2025-05-01T19:10:42Z

Any idea why?

toslali-ibm · 2025-05-01T19:33:07Z

Any idea why?

There was a mismatch between config and trainer (colocate vs. colocation). I fixed that, now there is another error I am debugging:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/workspace/open-r1/src/open_r1/grpo.py", line 179, in <module>
    main(script_args, training_args, model_args)
  File "/workspace/open-r1/src/open_r1/grpo.py", line 133, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2238, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2553, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 3724, in training_step
    inputs = self._prepare_inputs(inputs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/trl/trl/extras/profiling.py", line 87, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/trl/trl/trainer/grpo_trainer.py", line 991, in _prepare_inputs
    accumulated_local_batch = self._generate_and_score_completions(accumulated_local_batch)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/trl/trl/trainer/grpo_trainer.py", line 1024, in _generate_and_score_completions
    self._move_model_to_vllm()
  File "/workspace/trl/trl/extras/profiling.py", line 87, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/trl/trl/trainer/grpo_trainer.py", line 961, in _move_model_to_vllm
    llm_model = self.llm.llm_engine.model_executor.driver_worker.model
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 624, in __getattr__
    return getattr(self.worker, attr)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Worker' object has no attribute 'model'

toslali-ibm · 2025-05-01T20:15:35Z

Okay... I think all fixed now. I was able to run a quick training. I am now running sanity-check experiment for TP =1, 2, 4 and will report rewardds.

toslali-ibm · 2025-05-01T20:47:01Z

@qgallouedec , the sanity experiment looks good - please see the figure below.
Labels: Refactored (current version of the code) vs. vLLM coloc (before Quentin's refactor) vs. vLLM server

qgallouedec · 2025-05-01T20:57:28Z

Nice! Trying to run on my side.

qgallouedec · 2025-05-01T20:58:04Z

trl/trainer/grpo_trainer.py

+                    * self.args.gradient_accumulation_steps,
+                    max_model_len=self.max_prompt_length + self.max_completion_length,
+                    distributed_executor_backend="external_launcher",
+                    # Feed identical seed for tp groups to ensure ...


@toslali-ibm I wasn't sure how to motivate this, can you complete this comment?

btw, is os.getenv("RANK", "0") the same as self.accelerator.process_index? if so I'd use the later

…r/toslali-ibm/3394

qgallouedec

some changements not related to this PR, but that I did while merge main to your branch

qgallouedec · 2025-05-01T22:01:28Z

trl/trainer/grpo_trainer.py

-        # redirect the model.module forward to the model forward to ensure pre-forward hooks are called
-        self._forward_redirection = _ForwardRedirection()
        if self.use_liger_loss:
            if not is_liger_kernel_available():
                raise ImportError(
                    "Liger is required to use `liger_loss` as the GRPO loss. Run `pip install liger-kernel`."
                )

+            # Redirect the model.module forward to the model forward to ensure pre-forward hooks are called
+            self._forward_redirection = _ForwardRedirection()
+


not related to this PR

qgallouedec · 2025-05-01T22:01:47Z

trl/trainer/grpo_trainer.py

+    def _sync_fsdp_params_to_vllm(self, module: nn.Module, prefix: str = "", visited=None):
+        """Memory-efficient post-order traversal of FSDP modules to extract full parameters and sync with vLLM."""
+        if visited is None:
+            visited = set()
+
+        for child_name, child_module in module.named_children():
+            child_prefix = f"{prefix}.{child_name}" if prefix else child_name
+            self._sync_fsdp_params_to_vllm(
+                child_module, prefix=child_prefix, visited=visited
+            )  # recurse into the child
+
+        if isinstance(module, FSDP):
+            with FSDP.summon_full_params(module, recurse=False, writeback=False):
+                for param_name, param in module.named_parameters():
+                    full_name = f"{prefix}.{param_name}" if prefix else param_name
+                    for extra in ("_fsdp_wrapped_module.", "_checkpoint_wrapped_module."):
+                        full_name = full_name.replace(extra, "")
+
+                    if full_name in visited:
+                        continue  # skip FSDP subtrees already traversed
+                    visited.add(full_name)
+
+                    if self.vllm_mode == "server" and self.accelerator.is_main_process:
+                        self.vllm_client.update_named_param(full_name, param.data)
+                    elif self.vllm_mode == "colocate":
+                        llm_model = self.llm.llm_engine.model_executor.driver_worker.model_runner.model
+                        llm_model.load_weights([(full_name, param.data)])
+


not related to this PR

qgallouedec · 2025-05-01T22:02:12Z

trl/trainer/grpo_trainer.py

-            # With PEFT and FSDP/DeepSpeed ZeRO Stage 3, we must gather the full model at once before merging, as merging
-            # adapters in a sharded manner is not supported.
+            # With PEFT and FSDP/DeepSpeed ZeRO Stage 3, we must gather the full model at once before merging, as
+            # merging adapters in a sharded manner is not supported.
+            # TODO: does this work with FSDP?
            with gather_if_zero3(list(self.model.parameters())):
-                if self.is_fsdp_enabled:
+                self.model.merge_adapter()
+
+                # Update vLLM weights while parameters are gathered
+                if self.is_fsdp_enabled:  # note if using FSDP, gather_if_zero3 is nullcontext
                    # Update vLLM weights while parameters are gathered
                    # For PEFT with FSDP we need to use the memory efficient post-order traversal
-                    self.model.merge_adapter()
-                    post_order_fsdp_processing(self.model)
-                    self.model.unmerge_adapter()
+                    self._sync_fsdp_params_to_vllm(self.model)
                else:
-                    # DeepSpeed ZeRO-3 with PEFT (not using FSDP)
-                    self.model.merge_adapter()
+                    # DeepSpeed ZeRO-3 with PEFT


not related to this PR

HuggingFaceDocBuilderDev · 2025-05-01T22:07:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec

Wonderful work @toslali-ibm
Can you please take a final look before I merge?

toslali-ibm · 2025-05-01T23:14:00Z

Wonderful work @toslali-ibm Can you please take a final look before I merge?

Everything looks great—my sanity experiment ran successfully on the latest version. Thanks so much for the solid help, @qgallouedec ! :)

mayanks43 · 2025-05-03T12:42:19Z

Very cool improvement!

toslali-ibm and others added 5 commits April 28, 2025 19:16

Introduce vLLM colocation

6730711

Introduce only TP/DP

55f7f8c

Remove sleep enabled

b0971c6

Merge branch 'main' into tpnosleep

56d71cd

Remove partial state

0d7719f

toslali-ibm commented Apr 30, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

toslali-ibm commented Apr 30, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

toslali-ibm commented Apr 30, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

toslali-ibm added 2 commits April 30, 2025 12:29

Incorporate Quentins comments

b3fb364

Add max length comment/warning

126d636

toslali-ibm changed the title ~~Tpnosleep~~ Co-Locating vLLM w/ training to for higher throughput and GPU utilization Apr 30, 2025

toslali-ibm marked this pull request as ready for review April 30, 2025 18:19

vllm_mode and some qol

e7b3e67

missing " and '

2ab719a

qgallouedec reviewed May 1, 2025

View reviewed changes

Fix colocation mode

4f837d0

Fix typo in llm engine model

2255582

qol

cc3c7e0

qgallouedec reviewed May 1, 2025

View reviewed changes

toslali-ibm and others added 3 commits May 1, 2025 17:01

Adjust seed value and comment

1c1a6db

update doc

0da9ab3

Merge branch 'tpnosleep' of https://github.com/toslali-ibm/trl into p…

39b083a

…r/toslali-ibm/3394

Merge branch 'main' into pr/toslali-ibm/3394

057ba37

qgallouedec reviewed May 1, 2025

View reviewed changes

qgallouedec added 5 commits May 1, 2025 22:11

precise doc

0c670f1

try fix link

be2c98e

not really a warning

cfe67d6

profile generation

35d51f0

please work

abd5a5d

qgallouedec approved these changes May 1, 2025

View reviewed changes

rephrase and drop link

8d1d176

qgallouedec changed the title ~~Co-Locating vLLM w/ training to for higher throughput and GPU utilization~~ 🧑‍🤝‍🧑 Co-Locating vLLM w/ training to for higher throughput and GPU utilization May 1, 2025

qgallouedec merged commit 18596cf into huggingface:main May 1, 2025

This was referenced May 2, 2025

Introduce vLLM colocation #3383

Closed

Introduce TP on top of vLLM colocation #3393

Closed

tryumanshow mentioned this pull request Jun 13, 2025

Fix: corrected fsdp in GRPO trainer #3582

Open

5 tasks

🧑‍🤝‍🧑 Co-Locating vLLM w/ training to for higher throughput and GPU utilization #3394

🧑‍🤝‍🧑 Co-Locating vLLM w/ training to for higher throughput and GPU utilization #3394

Uh oh!

Conversation

toslali-ibm commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Enabler:

Benefits:

Testing vllm colocation

Sanity check

Before submitting

Who can review?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented May 1, 2025

Uh oh!

toslali-ibm commented May 1, 2025

Uh oh!

qgallouedec May 1, 2025

Choose a reason for hiding this comment

Uh oh!

toslali-ibm May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toslali-ibm commented May 1, 2025

Uh oh!

qgallouedec commented May 1, 2025

Uh oh!

toslali-ibm commented May 1, 2025

Uh oh!

toslali-ibm commented May 1, 2025

Uh oh!

toslali-ibm commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented May 1, 2025

Uh oh!

qgallouedec May 1, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec May 1, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec May 1, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec May 1, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec May 1, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented May 1, 2025

Uh oh!

qgallouedec left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toslali-ibm commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayanks43 commented May 3, 2025

Uh oh!

Uh oh!

toslali-ibm commented Apr 30, 2025 •

edited

Loading

toslali-ibm May 1, 2025 •

edited

Loading

toslali-ibm commented May 1, 2025 •

edited

Loading

qgallouedec left a comment •

edited

Loading

qgallouedec left a comment •

edited

Loading

toslali-ibm commented May 1, 2025 •

edited

Loading