-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Reproduction
Question:The vram suddenly increase in vllm device, when training model update the parameters.
Description:
I'm fine-tuning the R1-32b-int4 model using the 2*A100(40gb) trl library, and I've implemented qlora support for grpo by using the code in unsloth-zoo, where one card is trained (cuda:0) and the other (cuda:1) generates the data, and the two cards just push the lora parameter, which lets me use the two a100 training grpo.
During training, the two models are loaded with the following explicit memory:
Neither one showed oom.
While training the model to update the parameters, I noticed:
Both cards showed a rise in video memory, which shouldn't be the case, only cuda:0 should be rising, and cuda:1 should not be changing.
I think the reason is on move_model_vllm, I found out by setting a breakpoint that the function was not called when the oom error occurred, which means that the increase in cuda:1's video memory was not caused by updating cuda0's model parameters to vllm.
And then, I found out by further debugging that the model called when the model calls self._get_per_token_logps is encapsulated as a dataparallel model with devices ids 0 and 1.
This makes the model call vllm's devices when calculating or updating parameters.
I further tracked this down and found that in the mid-transformer's trainer, if the gpu is greater than 1 the model is automatically encapsulated as a dataparallel model and all graphics cards are included:
if self.args.n_gpu > 1 and not getattr(model, "is_loaded_in_8bit", False):
model = nn.DataParallel(model)
outputs:
Traceback (most recent call last):
File "example.py", line 42, in <module>
...
System Info
Copy-paste the following information when reporting an issue:
- Platform: Linux-5.15.0-60-generic-x86_64-with-glibc2.31
- Python version: 3.10.14
- TRL version: 0.16.0.dev0+b55d9f0
- PyTorch version: 2.5.1
- CUDA device(s): NVIDIA A100-SXM4-40GB, NVIDIA A100-SXM4-40GB
- Transformers version: 4.48.3
- Accelerate version: 1.4.0
- Accelerate config: not found
- Datasets version: 3.0.1
- HF Hub version: 0.29.1
- bitsandbytes version: 0.45.3
- DeepSpeed version: not installed
- Diffusers version: 0.32.2
- Liger-Kernel version: not installed
- LLM-Blender version: not installed
- OpenAI version: 1.65.1
- PEFT version: 0.14.0
- vLLM version: 0.7.3
Checklist
- I have checked that my issue isn't already filed (see open issues)
- I have included my system information
- Any code provided is minimal, complete, and reproducible (more on MREs)
- Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
- Any traceback provided is complete