OOM 8xH100 using latest GRPO code with vLLM

### Reproduction

Model is 8B.

Works fine not using vLLM w/deepspeed, when enabling vLLM and using deepspeed I get oom on the vLLM device when the model is loading:

```
INFO 01-30 05:50:12 model_runner.py:1115] Loading model weights took 0.0000 GB
INFO 01-30 05:50:16 worker.py:266] Memory profiling takes 3.26 seconds
INFO 01-30 05:50:16 worker.py:266] the current vLLM instance can use total_gpu_memory (79.10GiB) x gpu_memory_utilization (0.90) = 71.19GiB
INFO 01-30 05:50:16 worker.py:266] model weights take 0.00GiB; non_torch_memory takes 0.00GiB; PyTorch activation peak memory takes 0.00GiB; the rest of the memory reserved for KV Cache is 71.19GiB.
INFO 01-30 05:50:16 executor_base.py:108] # CUDA blocks: 39874, # CPU blocks: 2240
INFO 01-30 05:50:16 executor_base.py:113] Maximum concurrency for 131072 tokens per request: 4.87x
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ab/grpo/train.py", line 173, in <module>
[rank0]:     trainer = GRPOTrainer(
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/trl/trainer/grpo_trainer.py", line 314, in __init__
[rank0]:     self.llm = LLM(
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/utils.py", line 1039, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 239, in __init__
[rank0]:     self.llm_engine = self.engine_class.from_engine_args(
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 482, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 274, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 427, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 119, in initialize_cache
[rank0]:     self.collective_rpc("initialize_cache",
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 49, in collective_rpc
[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/utils.py", line 2208, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/worker/worker.py", line 308, in initialize_cache
[rank0]:     self._init_cache_engine()
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/worker/worker.py", line 313, in _init_cache_engine
[rank0]:     self.cache_engine = [
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/worker/worker.py", line 314, in <listcomp>
[rank0]:     CacheEngine(self.cache_config, self.model_config,
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 62, in __init__
[rank0]:     self.gpu_cache = self._allocate_kv_cache(
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 81, in _allocate_kv_cache
[rank0]:     torch.zeros(kv_cache_shape,
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.74 GiB. GPU 6 has a total capacity of 79.10 GiB of which 961.88 MiB is free. Including non-PyTorch memory, this process has 78.15 GiB memory in use. Of the allocated memory 77.42 GiB is allocated by PyTorch, and 74.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```


### System Info

```
- Platform: Linux-6.8.0-47-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version: 2.5.1
- CUDA device(s): NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3
- Transformers version: 4.48.1
- Accelerate version: 1.3.0
- Accelerate config: 
  - compute_environment: LOCAL_MACHINE
  - distributed_type: DEEPSPEED
  - use_cpu: False
  - debug: False
  - num_processes: 7
  - machine_rank: 0
  - num_machines: 1
  - rdzv_backend: static
  - same_network: True
  - main_training_function: main
  - enable_cpu_affinity: False
  - deepspeed_config: {'deepspeed_config_file': 'configs/deepspeed.json', 'zero3_init_flag': False}
  - downcast_bf16: no
  - tpu_use_cluster: False
  - tpu_use_sudo: False
  - tpu_env: []
- Datasets version: 3.2.0
- HF Hub version: 0.28.0
- TRL version: 0.14.0.dev0
- bitsandbytes version: not installed
- DeepSpeed version: 0.16.3
- Diffusers version: not installed
- Liger-Kernel version: not installed
- LLM-Blender version: not installed
- OpenAI version: 1.60.2
- PEFT version: 0.9.0
```

### Checklist

- [x] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue))
- [x] I have included my system information
- [x] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any traceback provided is complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OOM 8xH100 using latest GRPO code with vLLM #2688

Reproduction

System Info

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOM 8xH100 using latest GRPO code with vLLM #2688

Description

Reproduction

System Info

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions