Skip to content

OOM 8xH100 using latest GRPO code with vLLM #2688

@abacaj

Description

@abacaj

Reproduction

Model is 8B.

Works fine not using vLLM w/deepspeed, when enabling vLLM and using deepspeed I get oom on the vLLM device when the model is loading:

INFO 01-30 05:50:12 model_runner.py:1115] Loading model weights took 0.0000 GB
INFO 01-30 05:50:16 worker.py:266] Memory profiling takes 3.26 seconds
INFO 01-30 05:50:16 worker.py:266] the current vLLM instance can use total_gpu_memory (79.10GiB) x gpu_memory_utilization (0.90) = 71.19GiB
INFO 01-30 05:50:16 worker.py:266] model weights take 0.00GiB; non_torch_memory takes 0.00GiB; PyTorch activation peak memory takes 0.00GiB; the rest of the memory reserved for KV Cache is 71.19GiB.
INFO 01-30 05:50:16 executor_base.py:108] # CUDA blocks: 39874, # CPU blocks: 2240
INFO 01-30 05:50:16 executor_base.py:113] Maximum concurrency for 131072 tokens per request: 4.87x
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ab/grpo/train.py", line 173, in <module>
[rank0]:     trainer = GRPOTrainer(
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/trl/trainer/grpo_trainer.py", line 314, in __init__
[rank0]:     self.llm = LLM(
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/utils.py", line 1039, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 239, in __init__
[rank0]:     self.llm_engine = self.engine_class.from_engine_args(
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 482, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 274, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 427, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 119, in initialize_cache
[rank0]:     self.collective_rpc("initialize_cache",
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 49, in collective_rpc
[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/utils.py", line 2208, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/worker/worker.py", line 308, in initialize_cache
[rank0]:     self._init_cache_engine()
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/worker/worker.py", line 313, in _init_cache_engine
[rank0]:     self.cache_engine = [
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/worker/worker.py", line 314, in <listcomp>
[rank0]:     CacheEngine(self.cache_config, self.model_config,
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 62, in __init__
[rank0]:     self.gpu_cache = self._allocate_kv_cache(
[rank0]:   File "/home/ab/grpo/env/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 81, in _allocate_kv_cache
[rank0]:     torch.zeros(kv_cache_shape,
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.74 GiB. GPU 6 has a total capacity of 79.10 GiB of which 961.88 MiB is free. Including non-PyTorch memory, this process has 78.15 GiB memory in use. Of the allocated memory 77.42 GiB is allocated by PyTorch, and 74.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

System Info

- Platform: Linux-6.8.0-47-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version: 2.5.1
- CUDA device(s): NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3
- Transformers version: 4.48.1
- Accelerate version: 1.3.0
- Accelerate config: 
  - compute_environment: LOCAL_MACHINE
  - distributed_type: DEEPSPEED
  - use_cpu: False
  - debug: False
  - num_processes: 7
  - machine_rank: 0
  - num_machines: 1
  - rdzv_backend: static
  - same_network: True
  - main_training_function: main
  - enable_cpu_affinity: False
  - deepspeed_config: {'deepspeed_config_file': 'configs/deepspeed.json', 'zero3_init_flag': False}
  - downcast_bf16: no
  - tpu_use_cluster: False
  - tpu_use_sudo: False
  - tpu_env: []
- Datasets version: 3.2.0
- HF Hub version: 0.28.0
- TRL version: 0.14.0.dev0
- bitsandbytes version: not installed
- DeepSpeed version: 0.16.3
- Diffusers version: not installed
- Liger-Kernel version: not installed
- LLM-Blender version: not installed
- OpenAI version: 1.60.2
- PEFT version: 0.9.0

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions