-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine #20062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine #20062
Conversation
Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @izhuhaoran, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request implements a bugfix aimed at improving the performance of the V1 engine, specifically concerning CUDA graph capture. The core change ensures that the default cuda_graph_sizes
configuration automatically scales with the max_num_seqs
setting, thereby optimizing batch processing and preventing scenarios where certain batch sizes might not be efficiently handled by CUDA graphs.
Highlights
- Default CUDA Graph Sizes: The default initialization of
cuda_graph_sizes
in theSchedulerConfig
has been changed. Previously, it defaulted to a fixed[512]
. Now, if not explicitly provided by the user, it will dynamically default to[max_num_seqs]
. - Performance Improvement / Bugfix: This change addresses a potential performance bottleneck in the V1 engine. By aligning the default
cuda_graph_sizes
withmax_num_seqs
, it ensures that CUDA graph capture covers all relevant batch sizes, preventing performance degradation whenmax_num_seqs
is larger than the previous fixed default.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request modifies the cuda_graph_sizes
configuration in vllm/config.py
to default to max_num_seqs
for the V1 engine when no value is provided by the user. This change aims to address a potential performance issue where a fixed cuda-graph-sizes
value might not cover all batch sizes when max_num_seqs
is set to a larger value. The code changes include updating the default value of cuda_graph_sizes
to None
and setting it to [self.max_num_seqs]
in the __post_init__
method if it's None
.
cc @ProExpertProg for visibility |
Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
@aarnphm @ProExpertProg Could you please take a final look at this PR to ensure it's ready for auto-merge? |
I am realizing that this effectively doubled the cudagraph capture time (from 512 to 1024). CUDA Graphs are less important at larger batch sizes and it's not clear to me that the performance benefits warrant longer collection time. @izhuhaoran do you have any performance numbers that show otherwise? |
Thank you for the feedback. Sorry that I haven't specifically benchmarked the performance impact for larger batches (e.g., >512), but I wanted to address a usability concern with the previous default behavior. In my view, the previous default value of [512] introduced two key issues:
This PR only aligns the default cuda_graph_sizes with max_num_seqs to make the behavior more intuitive and robust for typical use cases. While this may not be the perfect tradeoff either, I think a fixed default of 512 is more problematic. Do you have alternative suggestions for the default size configuration? For users sensitive to startup overhead or seeking optimal performance, I'd recommend explicitly setting --cuda-graph-sizes to match their specific workload requirements rather than relying on defaults. |
…llm-project#20062) Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
First, I don't think this is a clear decision given chunked prefill is on by default. You also need to consider max_num_batched_tokens, right? |
I think also the actual shape we cudagraph is not the number of sequences (limited by |
Thank you for the feedback. I completely agree with your points — during the PR, I also recognized these limitations you said. The current implementation aligns cuda_graph_sizes with max_num_seqs as a simple default, as I mentioned earlier ("While this may not be the perfect tradeoff either"). I think fixed [512] is arguably even worse. I fully support further discussion on better strategies for this setting. At this moment, I don't have a more optimal solution to propose, but I'd be happy to explore alternatives (e.g., heuristic-based sizing, adaptive capture ranges) that balance performance and overhead more effectively. |
I agree with you. Aligning to max_num_seqs and hardcoding [512] are both suboptimal choices, we need to further discuss this. |
Yes but one of them has doubled the default capture time, increasing startup time. Also, I don't think we can reach a better solution without comprehensive benchmarking with and without cudagraphs. I agree with @mgoin that we should revert this until we have those numbers and reach a decision. It would be great if you could help with some of the benchmarking! Do you have bandwidth to do so? |
I'd be glad to do some benchmarking. I will test the performance difference with and without CUDA graphs for While my bandwidth is limited, I'll make time to run these tests to help improve current setting. After all, a reasonable capture size setting is crucial for performance. |
Can I try running for a few different model sizes? I assume cudagraphs are more important for smaller models. |
I plan to test qwen2.5-7b & qwen2.5-72b, both tp 4, different max_num_seqs, any suggestion for this plan? And I would appreciate it if you have time to have time do some test as my bandwidth is limited. |
…llm-project#20062) Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
Here are some benchmark results, cc @ProExpertProg -- Any thoughts? Qwen2.5-0.5B + TP1 + Nvidia H20 (inlen=1, num_prompts=4096)
Qwen2.5-7B + TP4 + Nvidia H20 (inlen=1, num_prompts=4096)
|
…llm-project#20062) Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
…llm-project#20062) Signed-off-by: izhuhaoran <izhuhaoran@qq.com> Signed-off-by: Will Eaton <weaton@redhat.com>
…llm-project#20062) Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
…llm-project#20062) Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
@mgoin @ProExpertProg -- concern on loading time and we may actually chunk based on max_num_batched_tokens is valid. wondering do you feel min(max_num_seqs * 2, 512) is an ok compromise better than status quo? more advanced requirements can go through config customization we are running into similar issues when running this at very tight mem and very low batch size situation only. it still ooms while trying to capture cudagraph for high batch size we don't need. usable should be more important than potential perf gaps with slightly fewer batch size captured due to chunked prefill. |
@yeqcharlotte I think that is a reasonable compromise for now. I understand the advantage of reducing the cudagraphs for small batches |
I agree with using "min(max_num_seqs * 2, 512)" as the default. @mgoin ,Shall I update a commit and resubmit as a new PR? also cc @ProExpertProg |
just submit a new pr, or revert revert, whichever easier. also cc: @houseroad @zou3519 |
Just submit a new PR in #20628. cc @mgoin @yeqcharlotte |
…llm-project#20062) Signed-off-by: izhuhaoran <izhuhaoran@qq.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
…llm-project#20062) Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
I noticed that in the V1 engine, the default
cuda-graph-sizes
is a fixed number512
. This may create a performance problem if a user setsmax_num_seqs
to a larger value, because not all batch sizes will be covered by the CUDA graphs.To fix this, this PR makes the default
cuda-graph_sizes
align withmax_num_seqs
when the user doesn't provide it.