-
Notifications
You must be signed in to change notification settings - Fork 2.1k
π§βπ€βπ§ Co-Locating vLLM w/ training to for higher throughput and GPU utilization #3394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@toslali-ibm I've updated your PR by changing the logic a bit, so I'll let you have a look, test it out, and tell me what you think. |
I think it looks good. Let me run a sanity-check experiment. |
trl/trainer/grpo_trainer.py
Outdated
torch.distributed.all_gather_object(gathered_prompts, prompts_text, group=self.tp_group) | ||
prompts_text = [p for sublist in gathered_prompts for p in sublist] | ||
|
||
all_outputs = self.llm.generate(prompts_text, sampling_params=sampling_params, use_tqdm=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a multi-GPUsetup (2 GPUs) with TP=2, if each rank is given a separate subset of promptsβe.g., rank 0 gets ["a", "b"]
and rank 1 gets ["c", "d"]
. Does each rank independently call:
llm.generate(["a", "b", "c", "d"])
It seems like duplicated call, but is it coordinated such that each rank only processes its subset of prompts? In other words, if the full prompt list is passed on each rank, does vLLM handle this duplication internally to avoid redundant work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. When using TP along with external_launcher, we need to make sure that all participating shards receive the same prompts -- and vLLM internally handles it.
So if TP = 2 and GPU = 2, then all workers get the ["a", "b", "c", "d"]
So if TP = 1 and GPU = 2, then first worker get the ["a", "b"]
and second worker get the ["c", "d"]
I am getting an error from the current version of the code
|
Any idea why? |
There was a mismatch between config and trainer (colocate vs. colocation). I fixed that, now there is another error I am debugging:
|
Okay... I think all fixed now. I was able to run a quick training. I am now running sanity-check experiment for TP =1, 2, 4 and will report rewardds. |
@qgallouedec , the sanity experiment looks good - please see the figure below. |
Nice! Trying to run on my side. |
trl/trainer/grpo_trainer.py
Outdated
* self.args.gradient_accumulation_steps, | ||
max_model_len=self.max_prompt_length + self.max_completion_length, | ||
distributed_executor_backend="external_launcher", | ||
# Feed identical seed for tp groups to ensure ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@toslali-ibm I wasn't sure how to motivate this, can you complete this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, is os.getenv("RANK", "0")
the same as self.accelerator.process_index
? if so I'd use the later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some changements not related to this PR, but that I did while merge main to your branch
# redirect the model.module forward to the model forward to ensure pre-forward hooks are called | ||
self._forward_redirection = _ForwardRedirection() | ||
if self.use_liger_loss: | ||
if not is_liger_kernel_available(): | ||
raise ImportError( | ||
"Liger is required to use `liger_loss` as the GRPO loss. Run `pip install liger-kernel`." | ||
) | ||
|
||
# Redirect the model.module forward to the model forward to ensure pre-forward hooks are called | ||
self._forward_redirection = _ForwardRedirection() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR
def _sync_fsdp_params_to_vllm(self, module: nn.Module, prefix: str = "", visited=None): | ||
"""Memory-efficient post-order traversal of FSDP modules to extract full parameters and sync with vLLM.""" | ||
if visited is None: | ||
visited = set() | ||
|
||
for child_name, child_module in module.named_children(): | ||
child_prefix = f"{prefix}.{child_name}" if prefix else child_name | ||
self._sync_fsdp_params_to_vllm( | ||
child_module, prefix=child_prefix, visited=visited | ||
) # recurse into the child | ||
|
||
if isinstance(module, FSDP): | ||
with FSDP.summon_full_params(module, recurse=False, writeback=False): | ||
for param_name, param in module.named_parameters(): | ||
full_name = f"{prefix}.{param_name}" if prefix else param_name | ||
for extra in ("_fsdp_wrapped_module.", "_checkpoint_wrapped_module."): | ||
full_name = full_name.replace(extra, "") | ||
|
||
if full_name in visited: | ||
continue # skip FSDP subtrees already traversed | ||
visited.add(full_name) | ||
|
||
if self.vllm_mode == "server" and self.accelerator.is_main_process: | ||
self.vllm_client.update_named_param(full_name, param.data) | ||
elif self.vllm_mode == "colocate": | ||
llm_model = self.llm.llm_engine.model_executor.driver_worker.model_runner.model | ||
llm_model.load_weights([(full_name, param.data)]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR
# With PEFT and FSDP/DeepSpeed ZeRO Stage 3, we must gather the full model at once before merging, as merging | ||
# adapters in a sharded manner is not supported. | ||
# With PEFT and FSDP/DeepSpeed ZeRO Stage 3, we must gather the full model at once before merging, as | ||
# merging adapters in a sharded manner is not supported. | ||
# TODO: does this work with FSDP? | ||
with gather_if_zero3(list(self.model.parameters())): | ||
if self.is_fsdp_enabled: | ||
self.model.merge_adapter() | ||
|
||
# Update vLLM weights while parameters are gathered | ||
if self.is_fsdp_enabled: # note if using FSDP, gather_if_zero3 is nullcontext | ||
# Update vLLM weights while parameters are gathered | ||
# For PEFT with FSDP we need to use the memory efficient post-order traversal | ||
self.model.merge_adapter() | ||
post_order_fsdp_processing(self.model) | ||
self.model.unmerge_adapter() | ||
self._sync_fsdp_params_to_vllm(self.model) | ||
else: | ||
# DeepSpeed ZeRO-3 with PEFT (not using FSDP) | ||
self.model.merge_adapter() | ||
# DeepSpeed ZeRO-3 with PEFT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful work @toslali-ibm
Can you please take a final look before I merge?
Everything looks greatβmy sanity experiment ran successfully on the latest version. Thanks so much for the solid help, @qgallouedec ! :) |
Very cool improvement! |
What does this PR do?
Enables colocating vLLM with training in each GPU to improve utilization and throughput.
Fixes #3064 and #3113
Addresses: #3195, #2971, #2922, #2887 etc.
Enabler:
vLLM (version >0.7.3) introduced support for an external launcher, allowing vLLM processes to run alongside other workloads on the same GPU.
Benefits:
Testing vllm colocation
Run it w/ the following:
VLLM_USE_V1=0 ACCELERATE_LOG_LEVEL=info CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --config_file recipes/accelerate_configs/zero3.yaml --num_processes=8 -m open_r1.grpo --config config_tpcoloc.yaml
vllm_colocation
in the config to the sharding you would like.vllm_colocation=1
, model is not sharded, each GPU holds a full copy of the model.vllm_colocation=2
, model is sharded by two, and groups: [0,1], [2,3], [4,5], [6,7].Click to view config.yaml
Sanity check
Qwen/Qwen2.5-Math-1.5B
onDigitalLearningGmbH/MATH-lighteval
dataset (as shown above) using both plain TRL (w/ vLLM server) and colocated TRL (w/ TP =1,TP =2, and TP =4); The rewards are identical.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
New version of #3162 (incorporated @qgallouedec 's comments)
CC @fabianlim