Skip to content

Conversation

toslali-ibm
Copy link
Contributor

@toslali-ibm toslali-ibm commented Jun 9, 2025

What does this PR do?

This PR introduces /scripts/recommend_gpu_mem_util.py script to help estimate the recommended GPU memory utilization based on model configuration and experiment settings.

How to use it:

python recommend_gpu_util.py --model_config https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B/resolve/main/config.json --exp_config ./experiment.yaml

below is the experiment.yaml

per_device_train_batch_size: 4
max_prompt_length: 1024
max_completion_length: 256
vllm_data_parallel_size: 4 

output:

Estimated model_params from config: 8.19B
KV_cache_per_token_MB: 0.14
KV_cache_total_GB: 0.70
Model_size_GB: 16.38
Buffer_GB: 3.42
Total_required_GB: 20.50
GPU_mem_util: 0.26
GPU_mem_util_recommended: 0.30
-------
Recommended vLLM GPU memory utilization: 0.30

Note: the link (path to the model config) has to be the specific downloadable one -- see the below SS

image

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

CC @qgallouedec and @fabianlim

@toslali-ibm toslali-ibm marked this pull request as ready for review June 9, 2025 13:43
@shirinyamani
Copy link
Member

shirinyamani commented Jun 9, 2025

Hi @toslali-ibm thanks for this addition;
A couple of notes/comments as I've tested your PR;

  1. Testing with (we can add this to the PR description)
(trl) shirin_yamani@ip-26-0-163-58:/fsx/shirin/trl/scripts$ python recommend_gpu_util.py --model_config https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B/resolve/main/config.json --exp_config ./experiment.yaml

and the below experiment.yaml

per_device_train_batch_size: 4
max_prompt_length: 1024
max_completion_length: 256
vllm_data_parallel_size: 4 

output:

/fsx/shirin/miniconda3/envs/trl/lib/python3.10/site-packages/transformers/utils/hub.py:600: FutureWarning: Using `from_pretrained` with the url of a file (here https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B/resolve/main/config.json) is deprecated and won't be possible anymore in v5 of Transformers. You should host your file on the Hub (hf.co) instead and use the repository ID. Note that this is not compatible with the caching system (your file will be downloaded at each execution) or multiple processes (each process will download the file in a different temporary file).
  warnings.warn(
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 859/859 [00:00<00:00, 10.1MB/s]
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}
Estimated model_params from config: 8.19B
KV_cache_per_token_MB: 0.14
KV_cache_total_GB: 0.70
Model_size_GB: 16.38
Buffer_GB: 3.42
Total_required_GB: 20.50
GPU_mem_util: 0.26
GPU_mem_util_recommended: 0.30
-------
Recommended vLLM GPU memory utilization: 0.30
  1. the link (path to the model config) has to be the specific downloadable one

Screenshot 2025-06-09 at 17 31 53

toslali-ibm and others added 3 commits June 10, 2025 08:42
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
@toslali-ibm
Copy link
Contributor Author

  1. g with (we can add this to the PR description)

This is great; Thanks a lot @shirinyamani for trying it out. I added your instructions to the PR description.

@qgallouedec
Copy link
Member

Nice @toslali-ibm

I think it would be better to have this as a space (have you familiar with gradio?) that we can embed in the doc. WDYT?

@toslali-ibm
Copy link
Contributor Author

Nice @toslali-ibm

I think it would be better to have this as a space (have you familiar with gradio?) that we can embed in the doc. WDYT?

Ok I created gradio script. Please see it below and run it like python recommend.py. Also, where do you deploy/serve it?

Click to expand `recommend.py`!
import math
import gradio as gr
from transformers import AutoConfig, AutoModelForCausalLM
from accelerate import init_empty_weights

def recommend_gpu_mem_util(
    model_config_url,
    batch_size,
    max_prompt_length,
    max_completion_length,
    tp_size,
    gpu_memory=79,
    precision_in_bytes=2,
    kv_multiplier=2
):
    # Load model config from HF URL
    try:
        config = AutoConfig.from_pretrained(model_config_url)
    except Exception as e:
        msg = f"Failed to load model config from URL: {e}"
        return msg, {"Error": msg}

    # Extract model config params
    try:
        num_hidden_layers = getattr(config, "num_hidden_layers")
        hidden_size = getattr(config, "hidden_size")
        num_attention_heads = getattr(config, "num_attention_heads")
        num_key_value_heads = getattr(config, "num_key_value_heads", num_attention_heads)
    except Exception as e:
        msg = f"Required field missing in model config: {e}"
        return msg, {"Error": msg}

    # Estimate model no. parameters
    try:
        with init_empty_weights():
            model = AutoModelForCausalLM.from_config(config)
        num_params = sum(p.numel() for p in model.parameters())
        model_params = num_params / 1e9
        est_msg = f"Estimated model_params from config: {model_params:.2f}B"
    except Exception as e:
        msg = f"Failed to estimate model parameters: {e}"
        return msg, {"Error": msg}

    # Calculate all memory and utilization values
    try:
        seq_len = max_prompt_length + max_completion_length

        model_size = float(model_params) * 1024**3 * precision_in_bytes / tp_size
        
        # KV_cache_per_token = kv_multiplier (K and V) * num_hidden_layers * (num_key_value_heads * hidden_size / num_attention_heads) * precision_in_bytes
        kv_cache_per_token = (
            kv_multiplier
            * num_hidden_layers
            * (num_key_value_heads * hidden_size / num_attention_heads)
            * precision_in_bytes
        )
        # KV_cache_total = KV_cache_per_token * Batch_size * Seq_len (max_prompt_length + max_completion_length)
        kv_cache_total = kv_cache_per_token * batch_size * seq_len
        # Buffer = (Model + KV_cache) * 0.2  # generous 20% buffer
        buffer_size = 0.2 * (model_size + kv_cache_total)
        # Total = Model + KV_cache + Buffer
        total_required = model_size + kv_cache_total + buffer_size
        # GPU utilization = Total_reqd / Total_gpu
        gpu_memory_bytes = float(gpu_memory) * 1024**3
        gpu_utilization_ratio = total_required / gpu_memory_bytes
        # Round up to nearest 0.05 - this generous estimate works much better than actual prediction!
        rounded_utilization = math.ceil(gpu_utilization_ratio * 20) / 20 + 0.05

        main_result = f"vllm_gpu_memory_utilization = {rounded_utilization:.2f}"
        ans = {
            "KV_cache_per_token_MB": kv_cache_per_token / 1024**2,
            "KV_cache_total_GB": kv_cache_total / 1024**3,
            "Model_size_GB": model_size / 1024**3,
            "Buffer_GB": buffer_size / 1024**3,
            "Total_required_GB": total_required / 1024**3,
            "GPU_mem_util": gpu_utilization_ratio,
            "GPU_mem_util_recommended": rounded_utilization,
            "model_params": est_msg,
            "num_hidden_layers": num_hidden_layers,
            "hidden_size": hidden_size,
            "num_attention_heads": num_attention_heads,
            "num_key_value_heads": num_key_value_heads,
        }

        return main_result, ans
    except Exception as e:
        msg = f"Error during calculation: {e}"
        return msg, {"Error": msg}

iface = gr.Interface(
    fn=recommend_gpu_mem_util,
    inputs=[
        gr.Textbox(label="Model Config URL (HuggingFace)", value="https://huggingface.co/Qwen/Qwen2.5-Math-1.5B/resolve/main/config.json"),
        gr.Number(label="per_device_train_batch_size", value=4),
        gr.Number(label="max_prompt_length", value=512),
        gr.Number(label="max_completion_length", value=512),
        gr.Number(label="vllm_tensor_parallel_size (tp_size)", value=1),
        gr.Number(label="GPU Memory (GB)", value=79),
        gr.Number(label="Precision in Bytes (e.g., 2)", value=2),
        gr.Number(label="KV Multiplier", value=2),
    ],
    outputs=[
        gr.Textbox(label="Recommended vLLM GPU Memory Utilization"),
        gr.JSON(label="Calculation Details"),
    ],
    title="vLLM GRPO GPU Memory Utilization Estimator",
    description = """
    Paste your HuggingFace model config URL (ending in config.json), and enter experiment details. 
    Model parameters are automatically extracted and estimated from the config.

    Note: This is a general recommendation and may not be optimal for your specific environment.
    Always verify your actual training GPU requirements. For example, if you're using DeepSpeed, consider utilizing their memory estimation tool:
    https://deepspeed.readthedocs.io/en/latest/memory.html

    If you encounter "not enough memory" errors, try increasing the GPU memory utilization setting.
    If you experience out-of-memory (OOM) errors, lower the utilization value and/or reduce your batch size.
    """,
    allow_flagging="never"
)

if __name__ == "__main__":
    iface.launch()

@qgallouedec
Copy link
Member

Nice!
Deployed here: https://huggingface.co/spaces/trl-lib/recommend-vllm-memory
I think you can directly embed the space in the GRPO doc, see https://huggingface.co/spaces/trl-lib/recommend-vllm-memory/settings?embed=true

@toslali-ibm
Copy link
Contributor Author

Nice! Deployed here: https://huggingface.co/spaces/trl-lib/recommend-vllm-memory I think you can directly embed the space in the GRPO doc, see https://huggingface.co/spaces/trl-lib/recommend-vllm-memory/settings?embed=true

Great, thanks @qgallouedec ! I cannot access the link provided (https://huggingface.co/spaces/trl-lib/recommend-vllm-memory/settings?embed=true). Is your recommendation that I remove scripts/recommend.py and instead update GRPO docs to embed the space?

@qgallouedec
Copy link
Member

@toslali-ibm, I've just granted you write access on https://huggingface.co/trl-lib/. Can you see the embed helper now?

@qgallouedec
Copy link
Member

qgallouedec commented Jun 15, 2025

Is your recommendation that I remove scripts/recommend.py and instead update GRPO docs to embed the space?

Yes

@toslali-ibm
Copy link
Contributor Author

How does this look @qgallouedec ?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec
Copy link
Member

For some reason it doesn't work with script, let's try with iframe

@qgallouedec
Copy link
Member

Nice!!

Screenshot 2025-06-18 at 4 11 42 PM

@qgallouedec qgallouedec changed the title Add vllm_gpu_memory_utilization recommendation script ⭐ Add vllm_gpu_memory_utilization recommendation script Jun 19, 2025
@qgallouedec qgallouedec merged commit 8bad863 into huggingface:main Jun 19, 2025
1 check passed
marcandrelarochelle pushed a commit to marcandrelarochelle/trl that referenced this pull request Jul 29, 2025
…e#3554)

Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants