⭐ Add `vllm_gpu_memory_utilization` recommendation script #3554

toslali-ibm · 2025-06-09T13:43:52Z

What does this PR do?

This PR introduces /scripts/recommend_gpu_mem_util.py script to help estimate the recommended GPU memory utilization based on model configuration and experiment settings.

How to use it:

python recommend_gpu_util.py --model_config https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B/resolve/main/config.json --exp_config ./experiment.yaml

below is the experiment.yaml

per_device_train_batch_size: 4
max_prompt_length: 1024
max_completion_length: 256
vllm_data_parallel_size: 4

output:

Estimated model_params from config: 8.19B
KV_cache_per_token_MB: 0.14
KV_cache_total_GB: 0.70
Model_size_GB: 16.38
Buffer_GB: 3.42
Total_required_GB: 20.50
GPU_mem_util: 0.26
GPU_mem_util_recommended: 0.30
-------
Recommended vLLM GPU memory utilization: 0.30

Note: the link (path to the model config) has to be the specific downloadable one -- see the below SS

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

CC @qgallouedec and @fabianlim

scripts/recommend_gpu_util.py

shirinyamani · 2025-06-09T23:47:14Z

Hi @toslali-ibm thanks for this addition;
A couple of notes/comments as I've tested your PR;

Testing with (we can add this to the PR description)

(trl) shirin_yamani@ip-26-0-163-58:/fsx/shirin/trl/scripts$ python recommend_gpu_util.py --model_config https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B/resolve/main/config.json --exp_config ./experiment.yaml

and the below experiment.yaml

per_device_train_batch_size: 4
max_prompt_length: 1024
max_completion_length: 256
vllm_data_parallel_size: 4

output:

/fsx/shirin/miniconda3/envs/trl/lib/python3.10/site-packages/transformers/utils/hub.py:600: FutureWarning: Using `from_pretrained` with the url of a file (here https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B/resolve/main/config.json) is deprecated and won't be possible anymore in v5 of Transformers. You should host your file on the Hub (hf.co) instead and use the repository ID. Note that this is not compatible with the caching system (your file will be downloaded at each execution) or multiple processes (each process will download the file in a different temporary file).
  warnings.warn(
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 859/859 [00:00<00:00, 10.1MB/s]
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}
Estimated model_params from config: 8.19B
KV_cache_per_token_MB: 0.14
KV_cache_total_GB: 0.70
Model_size_GB: 16.38
Buffer_GB: 3.42
Total_required_GB: 20.50
GPU_mem_util: 0.26
GPU_mem_util_recommended: 0.30
-------
Recommended vLLM GPU memory utilization: 0.30

the link (path to the model config) has to be the specific downloadable one

Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>

toslali-ibm · 2025-06-10T13:22:55Z

g with (we can add this to the PR description)

This is great; Thanks a lot @shirinyamani for trying it out. I added your instructions to the PR description.

qgallouedec · 2025-06-10T15:43:54Z

Nice @toslali-ibm

I think it would be better to have this as a space (have you familiar with gradio?) that we can embed in the doc. WDYT?

toslali-ibm · 2025-06-10T20:56:29Z

Nice @toslali-ibm

I think it would be better to have this as a space (have you familiar with gradio?) that we can embed in the doc. WDYT?

Ok I created gradio script. Please see it below and run it like python recommend.py. Also, where do you deploy/serve it?

Click to expand `recommend.py`!

import math
import gradio as gr
from transformers import AutoConfig, AutoModelForCausalLM
from accelerate import init_empty_weights

def recommend_gpu_mem_util(
    model_config_url,
    batch_size,
    max_prompt_length,
    max_completion_length,
    tp_size,
    gpu_memory=79,
    precision_in_bytes=2,
    kv_multiplier=2
):
    # Load model config from HF URL
    try:
        config = AutoConfig.from_pretrained(model_config_url)
    except Exception as e:
        msg = f"Failed to load model config from URL: {e}"
        return msg, {"Error": msg}

    # Extract model config params
    try:
        num_hidden_layers = getattr(config, "num_hidden_layers")
        hidden_size = getattr(config, "hidden_size")
        num_attention_heads = getattr(config, "num_attention_heads")
        num_key_value_heads = getattr(config, "num_key_value_heads", num_attention_heads)
    except Exception as e:
        msg = f"Required field missing in model config: {e}"
        return msg, {"Error": msg}

    # Estimate model no. parameters
    try:
        with init_empty_weights():
            model = AutoModelForCausalLM.from_config(config)
        num_params = sum(p.numel() for p in model.parameters())
        model_params = num_params / 1e9
        est_msg = f"Estimated model_params from config: {model_params:.2f}B"
    except Exception as e:
        msg = f"Failed to estimate model parameters: {e}"
        return msg, {"Error": msg}

    # Calculate all memory and utilization values
    try:
        seq_len = max_prompt_length + max_completion_length

        model_size = float(model_params) * 1024**3 * precision_in_bytes / tp_size
        
        # KV_cache_per_token = kv_multiplier (K and V) * num_hidden_layers * (num_key_value_heads * hidden_size / num_attention_heads) * precision_in_bytes
        kv_cache_per_token = (
            kv_multiplier
            * num_hidden_layers
            * (num_key_value_heads * hidden_size / num_attention_heads)
            * precision_in_bytes
        )
        # KV_cache_total = KV_cache_per_token * Batch_size * Seq_len (max_prompt_length + max_completion_length)
        kv_cache_total = kv_cache_per_token * batch_size * seq_len
        # Buffer = (Model + KV_cache) * 0.2  # generous 20% buffer
        buffer_size = 0.2 * (model_size + kv_cache_total)
        # Total = Model + KV_cache + Buffer
        total_required = model_size + kv_cache_total + buffer_size
        # GPU utilization = Total_reqd / Total_gpu
        gpu_memory_bytes = float(gpu_memory) * 1024**3
        gpu_utilization_ratio = total_required / gpu_memory_bytes
        # Round up to nearest 0.05 - this generous estimate works much better than actual prediction!
        rounded_utilization = math.ceil(gpu_utilization_ratio * 20) / 20 + 0.05

        main_result = f"vllm_gpu_memory_utilization = {rounded_utilization:.2f}"
        ans = {
            "KV_cache_per_token_MB": kv_cache_per_token / 1024**2,
            "KV_cache_total_GB": kv_cache_total / 1024**3,
            "Model_size_GB": model_size / 1024**3,
            "Buffer_GB": buffer_size / 1024**3,
            "Total_required_GB": total_required / 1024**3,
            "GPU_mem_util": gpu_utilization_ratio,
            "GPU_mem_util_recommended": rounded_utilization,
            "model_params": est_msg,
            "num_hidden_layers": num_hidden_layers,
            "hidden_size": hidden_size,
            "num_attention_heads": num_attention_heads,
            "num_key_value_heads": num_key_value_heads,
        }

        return main_result, ans
    except Exception as e:
        msg = f"Error during calculation: {e}"
        return msg, {"Error": msg}

iface = gr.Interface(
    fn=recommend_gpu_mem_util,
    inputs=[
        gr.Textbox(label="Model Config URL (HuggingFace)", value="https://huggingface.co/Qwen/Qwen2.5-Math-1.5B/resolve/main/config.json"),
        gr.Number(label="per_device_train_batch_size", value=4),
        gr.Number(label="max_prompt_length", value=512),
        gr.Number(label="max_completion_length", value=512),
        gr.Number(label="vllm_tensor_parallel_size (tp_size)", value=1),
        gr.Number(label="GPU Memory (GB)", value=79),
        gr.Number(label="Precision in Bytes (e.g., 2)", value=2),
        gr.Number(label="KV Multiplier", value=2),
    ],
    outputs=[
        gr.Textbox(label="Recommended vLLM GPU Memory Utilization"),
        gr.JSON(label="Calculation Details"),
    ],
    title="vLLM GRPO GPU Memory Utilization Estimator",
    description = """
    Paste your HuggingFace model config URL (ending in config.json), and enter experiment details. 
    Model parameters are automatically extracted and estimated from the config.

    Note: This is a general recommendation and may not be optimal for your specific environment.
    Always verify your actual training GPU requirements. For example, if you're using DeepSpeed, consider utilizing their memory estimation tool:
    https://deepspeed.readthedocs.io/en/latest/memory.html

    If you encounter "not enough memory" errors, try increasing the GPU memory utilization setting.
    If you experience out-of-memory (OOM) errors, lower the utilization value and/or reduce your batch size.
    """,
    allow_flagging="never"
)

if __name__ == "__main__":
    iface.launch()

qgallouedec · 2025-06-13T14:17:44Z

Nice!
Deployed here: https://huggingface.co/spaces/trl-lib/recommend-vllm-memory
I think you can directly embed the space in the GRPO doc, see https://huggingface.co/spaces/trl-lib/recommend-vllm-memory/settings?embed=true

toslali-ibm · 2025-06-13T15:07:21Z

Nice! Deployed here: https://huggingface.co/spaces/trl-lib/recommend-vllm-memory I think you can directly embed the space in the GRPO doc, see https://huggingface.co/spaces/trl-lib/recommend-vllm-memory/settings?embed=true

Great, thanks @qgallouedec ! I cannot access the link provided (https://huggingface.co/spaces/trl-lib/recommend-vllm-memory/settings?embed=true). Is your recommendation that I remove scripts/recommend.py and instead update GRPO docs to embed the space?

qgallouedec · 2025-06-15T10:54:17Z

@toslali-ibm, I've just granted you write access on https://huggingface.co/trl-lib/. Can you see the embed helper now?

qgallouedec · 2025-06-15T10:54:36Z

Is your recommendation that I remove scripts/recommend.py and instead update GRPO docs to embed the space?

Yes

toslali-ibm · 2025-06-16T13:49:54Z

How does this look @qgallouedec ?

HuggingFaceDocBuilderDev · 2025-06-18T13:15:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-06-18T13:20:16Z

For some reason it doesn't work with script, let's try with iframe

qgallouedec · 2025-06-18T14:11:56Z

Nice!!

…e#3554) Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

Add gpu util estimate script

f95dd48

toslali-ibm marked this pull request as ready for review June 9, 2025 13:43

Merge branch 'main' into util

142f149

shirinyamani reviewed Jun 9, 2025

View reviewed changes

scripts/recommend_gpu_util.py Outdated Show resolved Hide resolved

scripts/recommend_gpu_util.py Outdated Show resolved Hide resolved

scripts/recommend_gpu_util.py Outdated Show resolved Hide resolved

toslali-ibm and others added 3 commits June 10, 2025 08:42

Update scripts/recommend_gpu_util.py

7b386c7

Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>

Update scripts/recommend_gpu_util.py

4b9f9cb

Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>

Update scripts/recommend_gpu_util.py

e966218

Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>

Merge branch 'main' into util

7ef9ea0

Merge branch 'main' into util

47f4fe4

toslali-ibm and others added 3 commits June 16, 2025 09:38

Merge branch 'main' into util

97202de

Remove script and embed the space

9e6fa51

Try script as gradio-app

7579591

Merge branch 'main' into util

caafe22

try with iframe

cf54e07

nits

97c146e

qgallouedec approved these changes Jun 18, 2025

View reviewed changes

toslali-ibm and others added 2 commits June 18, 2025 10:41

Merge branch 'main' into util

e2bbbd2

Merge branch 'main' into util

2a0cbb5

qgallouedec changed the title ~~Add vllm_gpu_memory_utilization recommendation script~~ ⭐ Add vllm_gpu_memory_utilization recommendation script Jun 19, 2025

qgallouedec merged commit 8bad863 into huggingface:main Jun 19, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⭐ Add `vllm_gpu_memory_utilization` recommendation script #3554

⭐ Add `vllm_gpu_memory_utilization` recommendation script #3554

Uh oh!

toslali-ibm commented Jun 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shirinyamani commented Jun 9, 2025 •

edited

Loading

Uh oh!

toslali-ibm commented Jun 10, 2025

Uh oh!

qgallouedec commented Jun 10, 2025

Uh oh!

toslali-ibm commented Jun 10, 2025

Uh oh!

qgallouedec commented Jun 13, 2025

Uh oh!

toslali-ibm commented Jun 13, 2025

Uh oh!

qgallouedec commented Jun 15, 2025

Uh oh!

qgallouedec commented Jun 15, 2025 •

edited

Loading

Uh oh!

toslali-ibm commented Jun 16, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 18, 2025

Uh oh!

qgallouedec commented Jun 18, 2025

Uh oh!

qgallouedec commented Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

⭐ Add vllm_gpu_memory_utilization recommendation script #3554

⭐ Add vllm_gpu_memory_utilization recommendation script #3554

Uh oh!

Conversation

toslali-ibm commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

How to use it:

Before submitting

Who can review?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shirinyamani commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

toslali-ibm commented Jun 10, 2025

Uh oh!

qgallouedec commented Jun 10, 2025

Uh oh!

toslali-ibm commented Jun 10, 2025

Uh oh!

qgallouedec commented Jun 13, 2025

Uh oh!

toslali-ibm commented Jun 13, 2025

Uh oh!

qgallouedec commented Jun 15, 2025

Uh oh!

qgallouedec commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

toslali-ibm commented Jun 16, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 18, 2025

Uh oh!

qgallouedec commented Jun 18, 2025

Uh oh!

qgallouedec commented Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

⭐ Add `vllm_gpu_memory_utilization` recommendation script #3554

⭐ Add `vllm_gpu_memory_utilization` recommendation script #3554

toslali-ibm commented Jun 9, 2025 •

edited

Loading

shirinyamani commented Jun 9, 2025 •

edited

Loading

qgallouedec commented Jun 15, 2025 •

edited

Loading