[Core] FlashInfer CUTLASS fused MoE backend (NVFP4) #20037

wenscarl · 2025-06-24T19:29:46Z

This PR covers:

Integrates Flashinfer NVFP4 cutlass MoE kernel from Add CUTLASS fused moe kernels from TensorRT-LLM. flashinfer-ai/flashinfer#1113, which can be enabled with env var VLLM_USE_FLASHINFER_MOE=1. It supports

DP + EP
TP + EP
DP + TP + EP

Refactor cutlass_moe_fp4 to modular kernel structure. cutlass_moe_fp4 backend supports DP or TP or DP + TP. EP is not supported yet.

Example usage

Use Data Parallel script and specify model as nvidia/DeepSeek-R1-FP4, quantization as modelopt_fp4 and enable_expert_parallel=True

DP + EP

python data_parallel.py --dp-size=4 --tp-size=1

TP + EP, tested on B100x4

VLLM_USE_FLASHINFER_MOE=1 python workspace/vllm/vllm/benchmarks/benchmark_throughput.py --model=nvidia/DeepSeek-R1-FP4 --output-len=1024 --tensor-parallel-size=4 --input-len=1024 --max-model-len=2048 --trust-remote --load-format=dummy --gpu_memory_utilization=0.97 --max-num-seqs=128 --num-prompts=128 --enable_expert_parallel --quantization=modelopt_fp4 --enforce-eager

eager mode, v1 engine
Without FlashInfer CUTLASS MoE:
Throughput: 0.34 requests/s, 701.67 total tokens/s, 350.99 output tokens/s

With FlashInfer CUTLASS MoE:
Throughput: 0.65 requests/s, 1333.20 total tokens/s, 667.08 output tokens/s(with allgatherv and reduce-scatter from pynccl)(1.9x)
pynccl is approx 1.13x over native all2all.

graph mode V0 engine only(VLLM_USE_V1=0):
Without FlashInfer CUTLASS MoE:
Throughput: 0.56 requests/s, 1138.12 total tokens/s, 569.06 output tokens/s(2.1x)

With FlashInfer CUTLASS MoE:
Throughput: 1.18 requests/s, 2411.15 total tokens/s, 1208.55 output tokens/s

graph mode V1 engine
Without FlashInfer CUTLASS MoE:
Throughput: 0.40 requests/s, 825.88 total tokens/s, 413.37 output tokens/s
With Flashinfer CUTLASS MoE:
Throughput: 1.24 requests/s, 2542.87 total tokens/s, 1271.43 output tokens/s

cc @kushanam @pavanimajety @trevor-m @kaixih

github-actions · 2025-06-24T19:29:55Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @wenscarl, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new, highly optimized backend for Mixture-of-Experts (MoE) computations into vLLM. By utilizing FlashInfer's CUTLASS kernels, it aims to significantly improve performance for NVFP4 quantized MoE models across diverse distributed inference configurations. The changes involve adding a new environment variable for activation, introducing dedicated modules for FlashInfer's kernel application and data handling, and adapting existing MoE infrastructure to support its specific requirements for quantization and parallelism.

Highlights

FlashInfer Integration: Introduced a new backend for Mixture-of-Experts (MoE) operations leveraging FlashInfer's CUTLASS kernels, specifically optimized for NVFP4 quantization.
Configurability: Added the VLLM_USE_FLASHINFER_MOE environment variable, allowing users to enable or disable this new FlashInfer MoE backend at runtime.
Parallelism Support: Enhanced the MoE layer to seamlessly integrate with various parallelism strategies (Data Parallelism, Expert Parallelism, Tensor Parallelism) when using the new FlashInfer backend.
NVFP4 Quantization Handling: Implemented specific data preparation and finalization steps for NVFP4 quantized inputs and weights, including handling of scaling factors and adapting to FlashInfer's required weight layouts.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request integrates the Flashinfer CUTLASS MoE kernel for NVFP4, which is a valuable performance enhancement for Mixture of Experts models. The changes are well-structured, with the new logic mostly encapsulated in new files.

My review has identified a critical issue regarding an incorrect device capability check that will lead to a runtime error. Additionally, there's a high-severity concern about the use of unsafe tensor view operations, which poses a risk to future compatibility and maintainability. I've also noted a minor issue of dead code that should be cleaned up.

I recommend addressing these points to ensure the stability and long-term health of the codebase before merging.

vllm/model_executor/layers/quantization/modelopt.py

vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py

mergify · 2025-06-25T19:16:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wenscarl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py

vllm/model_executor/layers/quantization/modelopt.py

vllm/model_executor/layers/fused_moe/layer.py

vllm/model_executor/layers/quantization/modelopt.py

kaixih · 2025-07-01T05:59:33Z

vllm/model_executor/layers/fused_moe/utils.py

 ) -> tuple[torch.Tensor, Optional[torch.Tensor]]:
    if qtype == torch.float8_e4m3fn:
        return _fp8_quantize(A, A_scale, per_channel_quant, block_shape)
    elif qtype == torch.int8:
        return _int8_quantize(A, A_scale, per_channel_quant, block_shape)
+    elif qtype == torch.uint8:  # nvfp4


Can we make qtype to be torch.dtype or scalar_types and then use scalar_types.float4_e2m1f?

vllm/model_executor/layers/fused_moe/utils.py

mergify · 2025-07-05T05:31:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wenscarl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: shuw <shuw@nvidia.com>

mergify · 2025-07-17T16:17:15Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wenscarl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: shuw <shuw@nvidia.com>

vllm/model_executor/layers/fused_moe/layer.py

Signed-off-by: shuw <shuw@nvidia.com>

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin · 2025-07-17T17:10:44Z

vllm/model_executor/layers/fused_moe/layer.py

+    def maybe_swap_experts_impl(
+        self,
+        moe_parallel_config: FusedMoEParallelConfig,
+    ):


Would be good to leave a docstring on the purpose of this function and when to implement it

vllm/model_executor/layers/fused_moe/layer.py

mgoin

LGTM! Great work iterating on this. We will address the needed ModularKernel refactors for TP, Llama 4 support for this kernel, and CT NVFP4 integration in followup PRs. Thanks for your efforts.

I've validated the existing pathways for CT NVFP4 for Qwen3 and Llama 4 are working.

lm_eval --model vllm --model_args pretrained=nm-testing/Qwen3-30B-A3B-NVFP4,tensor_parallel_size=2,enforce_eager=True --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:32<00:00, 40.47it/s, est. speed input: 40182.41 toks/s, output: 5217.28 toks/s]
Running generate_until requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:32<00:00, 40.31it/s]
2025-07-17:13:07:34,275 INFO     [lm_eval.loggers.evaluation_tracker:272] Output path not provided, skipping saving results aggregated
vllm (pretrained=nm-testing/Qwen3-30B-A3B-NVFP4,tensor_parallel_size=2,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8787|±  |0.0090|
|     |       |strict-match    |     5|exact_match|↑  |0.8711|±  |0.0092|


lm_eval --model vllm --model_args pretrained=nm-testing/Qwen3-30B-A3B-NVFP4,tensor_parallel_size=2 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:25<00:00, 51.49it/s, est. speed input: 51125.17 toks/s, output: 6532.40 toks/s]
Running generate_until requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:25<00:00, 51.18it/s]
2025-07-17:13:18:04,157 INFO     [lm_eval.loggers.evaluation_tracker:272] Output path not provided, skipping saving results aggregated
vllm (pretrained=nm-testing/Qwen3-30B-A3B-NVFP4,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8832|±  |0.0088|
|     |       |strict-match    |     5|exact_match|↑  |0.8810|±  |0.0089|

mgoin · 2025-07-17T20:48:06Z

It looks like we are initializing CUDA too early now, breaking many tests. I'm looking into it

Signed-off-by: mgoin <mgoin64@gmail.com>