Skip to content

Conversation

wenscarl
Copy link
Contributor

@wenscarl wenscarl commented Jun 24, 2025

This PR covers:

  1. Integrates Flashinfer NVFP4 cutlass MoE kernel from Add CUTLASS fused moe kernels from TensorRT-LLM. flashinfer-ai/flashinfer#1113, which can be enabled with env var VLLM_USE_FLASHINFER_MOE=1. It supports
  • DP + EP
  • TP + EP
  • DP + TP + EP
  1. Refactor cutlass_moe_fp4 to modular kernel structure. cutlass_moe_fp4 backend supports DP or TP or DP + TP. EP is not supported yet.

Example usage

Use Data Parallel script and specify model as nvidia/DeepSeek-R1-FP4, quantization as modelopt_fp4 and enable_expert_parallel=True

DP + EP

python data_parallel.py --dp-size=4 --tp-size=1

TP + EP, tested on B100x4

VLLM_USE_FLASHINFER_MOE=1 python workspace/vllm/vllm/benchmarks/benchmark_throughput.py --model=nvidia/DeepSeek-R1-FP4 --output-len=1024 --tensor-parallel-size=4 --input-len=1024 --max-model-len=2048 --trust-remote --load-format=dummy --gpu_memory_utilization=0.97 --max-num-seqs=128 --num-prompts=128 --enable_expert_parallel --quantization=modelopt_fp4 --enforce-eager

eager mode, v1 engine
Without FlashInfer CUTLASS MoE:
Throughput: 0.34 requests/s, 701.67 total tokens/s, 350.99 output tokens/s

With FlashInfer CUTLASS MoE:
Throughput: 0.65 requests/s, 1333.20 total tokens/s, 667.08 output tokens/s(with allgatherv and reduce-scatter from pynccl)(1.9x)
pynccl is approx 1.13x over native all2all.

graph mode V0 engine only(VLLM_USE_V1=0):
Without FlashInfer CUTLASS MoE:
Throughput: 0.56 requests/s, 1138.12 total tokens/s, 569.06 output tokens/s(2.1x)

With FlashInfer CUTLASS MoE:
Throughput: 1.18 requests/s, 2411.15 total tokens/s, 1208.55 output tokens/s

graph mode V1 engine
Without FlashInfer CUTLASS MoE:
Throughput: 0.40 requests/s, 825.88 total tokens/s, 413.37 output tokens/s
With Flashinfer CUTLASS MoE:
Throughput: 1.24 requests/s, 2542.87 total tokens/s, 1271.43 output tokens/s

cc @kushanam @pavanimajety @trevor-m @kaixih

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @wenscarl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new, highly optimized backend for Mixture-of-Experts (MoE) computations into vLLM. By utilizing FlashInfer's CUTLASS kernels, it aims to significantly improve performance for NVFP4 quantized MoE models across diverse distributed inference configurations. The changes involve adding a new environment variable for activation, introducing dedicated modules for FlashInfer's kernel application and data handling, and adapting existing MoE infrastructure to support its specific requirements for quantization and parallelism.

Highlights

  • FlashInfer Integration: Introduced a new backend for Mixture-of-Experts (MoE) operations leveraging FlashInfer's CUTLASS kernels, specifically optimized for NVFP4 quantization.
  • Configurability: Added the VLLM_USE_FLASHINFER_MOE environment variable, allowing users to enable or disable this new FlashInfer MoE backend at runtime.
  • Parallelism Support: Enhanced the MoE layer to seamlessly integrate with various parallelism strategies (Data Parallelism, Expert Parallelism, Tensor Parallelism) when using the new FlashInfer backend.
  • NVFP4 Quantization Handling: Implemented specific data preparation and finalization steps for NVFP4 quantized inputs and weights, including handling of scaling factors and adapting to FlashInfer's required weight layouts.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates the Flashinfer CUTLASS MoE kernel for NVFP4, which is a valuable performance enhancement for Mixture of Experts models. The changes are well-structured, with the new logic mostly encapsulated in new files.

My review has identified a critical issue regarding an incorrect device capability check that will lead to a runtime error. Additionally, there's a high-severity concern about the use of unsafe tensor view operations, which poses a risk to future compatibility and maintainability. I've also noted a minor issue of dead code that should be cleaned up.

I recommend addressing these points to ensure the stability and long-term health of the codebase before merging.

Copy link

mergify bot commented Jun 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wenscarl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

) -> tuple[torch.Tensor, Optional[torch.Tensor]]:
if qtype == torch.float8_e4m3fn:
return _fp8_quantize(A, A_scale, per_channel_quant, block_shape)
elif qtype == torch.int8:
return _int8_quantize(A, A_scale, per_channel_quant, block_shape)
elif qtype == torch.uint8: # nvfp4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make qtype to be torch.dtype or scalar_types and then use scalar_types.float4_e2m1f?

@mgoin mgoin added this to the v0.9.2 milestone Jul 1, 2025
@wenscarl wenscarl force-pushed the flashinfer_fused_moe branch from 7fcb48e to e4e78be Compare July 3, 2025 15:34
@mergify mergify bot added performance Performance-related issues and removed needs-rebase labels Jul 3, 2025
@wenscarl wenscarl force-pushed the flashinfer_fused_moe branch 2 times, most recently from 1a219c1 to e589b4a Compare July 3, 2025 15:46
@wenscarl wenscarl force-pushed the flashinfer_fused_moe branch from e589b4a to 7cdb800 Compare July 3, 2025 15:48
Copy link

mergify bot commented Jul 5, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wenscarl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 5, 2025
Signed-off-by: shuw <shuw@nvidia.com>
Copy link

mergify bot commented Jul 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wenscarl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 17, 2025
@mergify mergify bot removed the needs-rebase label Jul 17, 2025
Signed-off-by: shuw <shuw@nvidia.com>
@wenscarl wenscarl requested a review from mgoin July 17, 2025 16:52
Signed-off-by: mgoin <mgoin64@gmail.com>
Comment on lines +214 to +217
def maybe_swap_experts_impl(
self,
moe_parallel_config: FusedMoEParallelConfig,
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to leave a docstring on the purpose of this function and when to implement it

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work iterating on this. We will address the needed ModularKernel refactors for TP, Llama 4 support for this kernel, and CT NVFP4 integration in followup PRs. Thanks for your efforts.

I've validated the existing pathways for CT NVFP4 for Qwen3 and Llama 4 are working.

lm_eval --model vllm --model_args pretrained=nm-testing/Qwen3-30B-A3B-NVFP4,tensor_parallel_size=2,enforce_eager=True --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:32<00:00, 40.47it/s, est. speed input: 40182.41 toks/s, output: 5217.28 toks/s]
Running generate_until requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:32<00:00, 40.31it/s]
2025-07-17:13:07:34,275 INFO     [lm_eval.loggers.evaluation_tracker:272] Output path not provided, skipping saving results aggregated
vllm (pretrained=nm-testing/Qwen3-30B-A3B-NVFP4,tensor_parallel_size=2,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8787|±  |0.0090|
|     |       |strict-match    |     5|exact_match|↑  |0.8711|±  |0.0092|


lm_eval --model vllm --model_args pretrained=nm-testing/Qwen3-30B-A3B-NVFP4,tensor_parallel_size=2 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:25<00:00, 51.49it/s, est. speed input: 51125.17 toks/s, output: 6532.40 toks/s]
Running generate_until requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:25<00:00, 51.18it/s]
2025-07-17:13:18:04,157 INFO     [lm_eval.loggers.evaluation_tracker:272] Output path not provided, skipping saving results aggregated
vllm (pretrained=nm-testing/Qwen3-30B-A3B-NVFP4,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8832|±  |0.0088|
|     |       |strict-match    |     5|exact_match|↑  |0.8810|±  |0.0089|

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 17, 2025
@mgoin
Copy link
Member

mgoin commented Jul 17, 2025

It looks like we are initializing CUDA too early now, breaking many tests. I'm looking into it

mgoin added 2 commits July 17, 2025 17:12
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
@simon-mo simon-mo merged commit c7d8724 into vllm-project:main Jul 18, 2025
71 of 73 checks passed
WorldExplored pushed a commit to nadathurv/vllm that referenced this pull request Jul 19, 2025
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
WorldExplored pushed a commit to nadathurv/vllm that referenced this pull request Jul 19, 2025
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: WorldExplored <srreyansh.sethi@gmail.com>
hj-mistral pushed a commit to hj-mistral/vllm that referenced this pull request Jul 19, 2025
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Himanshu Jaju <hj@mistral.ai>
LyrisZhong pushed a commit to LyrisZhong/vllm that referenced this pull request Jul 23, 2025
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: x22x22 <wadeking@qq.com>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
taneem-ibrahim pushed a commit to taneem-ibrahim/vllm that referenced this pull request Aug 14, 2025
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 27, 2025
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build deepseek Related to DeepSeek models performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants