Skip to content

Conversation

ProExpertProg
Copy link
Collaborator

@ProExpertProg ProExpertProg commented May 23, 2025

Enable fullgraph CUDAGraph capture for the FlashMLA decode case.

Hacks:

  • building the capture metadata
  • prefill batch bypasses compiled code and manually calls eager code

Tested with:

python examples/offline_inference/basic/generate.py --model deepseek-ai/DeepSeek-V2-Lite --trust-remote-code -O {"full_cuda_graph":true}

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@hypnopump
Copy link

Could you give some details on speedup associated with this modification?

Copy link

mergify bot commented May 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 23, 2025
@ProExpertProg
Copy link
Collaborator Author

Could you give some details on speedup associated with this modification?

I haven't necessarily profiled this but it's meant to enable the double-batch-overlap optimization (prototype in #18415)

@ProExpertProg ProExpertProg force-pushed the luka/mla-full-cudagraph branch 2 times, most recently from 976e852 to 40e7248 Compare May 30, 2025 20:21
Copy link

mergify bot commented Jun 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 4, 2025
@izhuhaoran
Copy link
Contributor

Hi, any further progress on this pr?

@ProExpertProg
Copy link
Collaborator Author

Hi, any further progress on this pr?

Almost ready for review!

@ProExpertProg ProExpertProg force-pushed the luka/mla-full-cudagraph branch from 5e3f7ab to 30562a2 Compare June 4, 2025 21:27
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
… hardcoded error

Signed-off-by: luka <luka@neuralmagic.com>
@ProExpertProg ProExpertProg force-pushed the luka/mla-full-cudagraph branch from f478ecd to ab519de Compare June 12, 2025 18:55
@ProExpertProg ProExpertProg marked this pull request as ready for review June 12, 2025 18:56
@mergify mergify bot removed the needs-rebase label Jun 12, 2025
self._num_prefill_tokens = 0
return self.build(0, m)

def build(self, common_prefix_len: int,
common_attn_metadata: CommonAttentionMetadata) -> M:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can put common_prefix_len in CommonAttentionMetadata too and just default it to 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently calculated per-backend though. I guess it should always be the same value?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm maybe not for this PR but I think common_prefix_len should always be the same regardless of use_cascade_attention and then the backend can just choose to ignore it if use_cascade_attention is false; then it would belong in common_prefix_len

))

attn_metadata_i = self.attn_metadata_builders[
kv_cache_group_id].build_for_cudagraph_capture(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _dummy_run is used for more then just cudagraph capture, what if the backend doesnt support build_for_cudagraph_capture? we should still be able to run dummy_runs

Copy link
Collaborator

@LucasWilkinson LucasWilkinson Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait I see build_for_cudagraph_capture is in the base class; I think this still a bit confusing for backends that done support full cuda-graphs

Copy link
Collaborator Author

@ProExpertProg ProExpertProg Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If a backend doesn't support it this path is not triggered (shouldn't be running with full cuda graphs)
  2. This method just calls build by default anyway

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a cudagraph_capturing flag to _dummy_run maybe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think skip_attn is enough, what would change if we added that flag?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think skip_attn is enough, what would change if we added that flag?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically do (attn_metadata_builder.build if not cudagraph_capturing else attn_metadata_builder.build_for_cudagraph_capture)(common_metadata)

just so we only use build_for_cudagraph_capture for cudagraph capture

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was we can do raise NotImplemented in build_for_cudagraph_capture if the backend doesnt support it (so we dont accidentally give the impression a backend supports cuda-graphs when it doesnt actually)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per offline discussion, agreed this interface is not ideal. But we only use _dummy_run with attention when capturing cudagraph capture. So I'll rename the flag, if in the future regular attention in _dummy_run is needed, a new flag can be added.

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking much better! thanks for doing the refactor, left a couple comments

Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for the refactor!

@LucasWilkinson LucasWilkinson enabled auto-merge (squash) June 13, 2025 17:32
@LucasWilkinson LucasWilkinson disabled auto-merge June 13, 2025 17:33
@LucasWilkinson LucasWilkinson enabled auto-merge (squash) June 13, 2025 17:36
@LucasWilkinson LucasWilkinson merged commit 3597b06 into vllm-project:main Jun 13, 2025
70 checks passed
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: minpeter <kali2005611@gmail.com>
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025
Signed-off-by: luka <luka@neuralmagic.com>
wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025
Signed-off-by: luka <luka@neuralmagic.com>
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025
Signed-off-by: luka <luka@neuralmagic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend llama Related to Llama models ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm structured-output v1
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants