[CUDA] Enable full cudagraph for FlashMLA #18581

ProExpertProg · 2025-05-23T02:26:13Z

Enable fullgraph CUDAGraph capture for the FlashMLA decode case.

Hacks:

building the capture metadata
prefill batch bypasses compiled code and manually calls eager code

Tested with:

python examples/offline_inference/basic/generate.py --model deepseek-ai/DeepSeek-V2-Lite --trust-remote-code -O {"full_cuda_graph":true}

github-actions · 2025-05-23T02:31:28Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

hypnopump · 2025-05-23T09:20:54Z

Could you give some details on speedup associated with this modification?

mergify · 2025-05-23T16:48:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/v1/worker/gpu_model_runner.py

ProExpertProg · 2025-05-28T17:34:53Z

Could you give some details on speedup associated with this modification?

I haven't necessarily profiled this but it's meant to enable the double-batch-overlap optimization (prototype in #18415)

vllm/v1/attention/backends/mla/flashmla.py

mergify · 2025-06-04T01:44:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

izhuhaoran · 2025-06-04T09:29:06Z

Hi, any further progress on this pr?

ProExpertProg · 2025-06-04T20:59:35Z

Hi, any further progress on this pr?

Almost ready for review!

Signed-off-by: luka <luka@neuralmagic.com>

… hardcoded error Signed-off-by: luka <luka@neuralmagic.com>

vllm/v1/attention/backends/utils.py

vllm/v1/worker/gpu_model_runner.py

LucasWilkinson · 2025-06-12T19:19:34Z

vllm/v1/attention/backends/mla/common.py

+        self._num_prefill_tokens = 0
+        return self.build(0, m)
+
+    def build(self, common_prefix_len: int,
              common_attn_metadata: CommonAttentionMetadata) -> M:


I think we can put common_prefix_len in CommonAttentionMetadata too and just default it to 0

It's currently calculated per-backend though. I guess it should always be the same value?

hmm maybe not for this PR but I think common_prefix_len should always be the same regardless of use_cascade_attention and then the backend can just choose to ignore it if use_cascade_attention is false; then it would belong in common_prefix_len

vllm/v1/attention/backends/utils.py

vllm/v1/worker/gpu_model_runner.py

LucasWilkinson · 2025-06-12T19:28:18Z

vllm/v1/worker/gpu_model_runner.py

-                    ))
+
+                attn_metadata_i = self.attn_metadata_builders[
+                    kv_cache_group_id].build_for_cudagraph_capture(


The _dummy_run is used for more then just cudagraph capture, what if the backend doesnt support build_for_cudagraph_capture? we should still be able to run dummy_runs

oh wait I see build_for_cudagraph_capture is in the base class; I think this still a bit confusing for backends that done support full cuda-graphs

If a backend doesn't support it this path is not triggered (shouldn't be running with full cuda graphs)

This method just calls build by default anyway

should we add a cudagraph_capturing flag to _dummy_run maybe?

I think skip_attn is enough, what would change if we added that flag?

basically do (attn_metadata_builder.build if not cudagraph_capturing else attn_metadata_builder.build_for_cudagraph_capture)(common_metadata)

just so we only use build_for_cudagraph_capture for cudagraph capture

this was we can do raise NotImplemented in build_for_cudagraph_capture if the backend doesnt support it (so we dont accidentally give the impression a backend supports cuda-graphs when it doesnt actually)

Per offline discussion, agreed this interface is not ideal. But we only use _dummy_run with attention when capturing cudagraph capture. So I'll rename the flag, if in the future regular attention in _dummy_run is needed, a new flag can be added.

LucasWilkinson

Overall this is looking much better! thanks for doing the refactor, left a couple comments

Signed-off-by: luka <luka@neuralmagic.com>

LucasWilkinson

LGTM thanks for the refactor!

Signed-off-by: luka <luka@neuralmagic.com> Signed-off-by: minpeter <kali2005611@gmail.com>

Signed-off-by: luka <luka@neuralmagic.com>

Signed-off-by: luka <luka@neuralmagic.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

Signed-off-by: luka <luka@neuralmagic.com>

ProExpertProg requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners May 23, 2025 02:26

mergify bot added the v1 label May 23, 2025

mergify bot added the needs-rebase label May 23, 2025

ProExpertProg commented May 27, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

ProExpertProg force-pushed the luka/mla-full-cudagraph branch from d5c7a35 to c794889 Compare May 28, 2025 18:54

ProExpertProg requested review from jeejeelee, mgoin and russellb as code owners May 28, 2025 18:54

mergify bot added frontend structured-output labels May 28, 2025

github-project-automation bot added this to Structured Output May 28, 2025

mergify bot removed the needs-rebase label May 28, 2025

ProExpertProg force-pushed the luka/mla-full-cudagraph branch from c794889 to 80f20ce Compare May 28, 2025 19:15

ProExpertProg commented May 29, 2025

View reviewed changes

vllm/v1/attention/backends/mla/flashmla.py Outdated Show resolved Hide resolved

ProExpertProg force-pushed the luka/mla-full-cudagraph branch 2 times, most recently from 976e852 to 40e7248 Compare May 30, 2025 20:21

mergify bot added the needs-rebase label Jun 4, 2025

ProExpertProg force-pushed the luka/mla-full-cudagraph branch from 5e3f7ab to 30562a2 Compare June 4, 2025 21:27

ProExpertProg added 3 commits June 12, 2025 18:55

Superclass for attention metadata builder.

a63e6d6

Signed-off-by: luka <luka@neuralmagic.com>

Refactor common attention metadata

2fd221d

Signed-off-by: luka <luka@neuralmagic.com>

Query builder for cudagraph support instead of per-layer metadata and…

ab519de

… hardcoded error Signed-off-by: luka <luka@neuralmagic.com>

ProExpertProg force-pushed the luka/mla-full-cudagraph branch from f478ecd to ab519de Compare June 12, 2025 18:55

ProExpertProg marked this pull request as ready for review June 12, 2025 18:56

mergify bot removed the needs-rebase label Jun 12, 2025

ProExpertProg commented Jun 12, 2025

View reviewed changes

vllm/v1/attention/backends/utils.py Show resolved Hide resolved

ProExpertProg commented Jun 12, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

LucasWilkinson reviewed Jun 12, 2025

View reviewed changes

vllm/v1/attention/backends/utils.py Show resolved Hide resolved

LucasWilkinson reviewed Jun 12, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

LucasWilkinson reviewed Jun 12, 2025

View reviewed changes

ProExpertProg added 2 commits June 12, 2025 19:41

PR comments

661bfbe

Signed-off-by: luka <luka@neuralmagic.com>

PR comments: change skip_attn flag

78a88db

Signed-off-by: luka <luka@neuralmagic.com>

LucasWilkinson approved these changes Jun 13, 2025

View reviewed changes

LucasWilkinson enabled auto-merge (squash) June 13, 2025 17:32

LucasWilkinson disabled auto-merge June 13, 2025 17:33

LucasWilkinson enabled auto-merge (squash) June 13, 2025 17:36

LucasWilkinson merged commit 3597b06 into vllm-project:main Jun 13, 2025
70 checks passed

github-project-automation bot moved this to Done in Structured Output Jun 13, 2025

minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025

[CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581)

24311b1

Signed-off-by: luka <luka@neuralmagic.com> Signed-off-by: minpeter <kali2005611@gmail.com>

This was referenced Jun 25, 2025

[Core][Bugfix] new way for full cudagraph, add support for FA2 and FlashInfer; Two minor bugs fixed #20050

Closed

[Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer #20059

Merged

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025

[CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581)

34eab65

Signed-off-by: luka <luka@neuralmagic.com>

wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025

[CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581)

70953a0

Signed-off-by: luka <luka@neuralmagic.com>

fhl2000 mentioned this pull request Jul 22, 2025

[V1][CUDA] Full cudagraph support for FlashInfer #21367

Merged

4 tasks

qiaoning mentioned this pull request Jul 24, 2025

[Feature]: Full cudagraph support for MLA attention backend with DeepSeek MTP(Speculative decode) #21505

Open

1 task

avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025

[CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581)

2fd05db

Signed-off-by: luka <luka@neuralmagic.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025

[CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581)

a4e5d1d

Signed-off-by: luka <luka@neuralmagic.com>

Uh oh!

[CUDA] Enable full cudagraph for FlashMLA #18581

[CUDA] Enable full cudagraph for FlashMLA #18581

Uh oh!

Conversation

ProExpertProg commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 23, 2025

Uh oh!

hypnopump commented May 23, 2025

Uh oh!

mergify bot commented May 23, 2025

Uh oh!

Uh oh!

ProExpertProg commented May 28, 2025

Uh oh!

Uh oh!

mergify bot commented Jun 4, 2025

Uh oh!

izhuhaoran commented Jun 4, 2025

Uh oh!

ProExpertProg commented Jun 4, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ProExpertProg commented May 23, 2025 •

edited

Loading

LucasWilkinson Jun 12, 2025 •

edited

Loading

ProExpertProg Jun 12, 2025 •

edited

Loading