-
-
Notifications
You must be signed in to change notification settings - Fork 9.9k
[Core] FlashInfer CUTLASS fused MoE backend (NVFP4) #20037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @wenscarl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request integrates a new, highly optimized backend for Mixture-of-Experts (MoE) computations into vLLM. By utilizing FlashInfer's CUTLASS kernels, it aims to significantly improve performance for NVFP4 quantized MoE models across diverse distributed inference configurations. The changes involve adding a new environment variable for activation, introducing dedicated modules for FlashInfer's kernel application and data handling, and adapting existing MoE infrastructure to support its specific requirements for quantization and parallelism.
Highlights
- FlashInfer Integration: Introduced a new backend for Mixture-of-Experts (MoE) operations leveraging FlashInfer's CUTLASS kernels, specifically optimized for NVFP4 quantization.
- Configurability: Added the
VLLM_USE_FLASHINFER_MOE
environment variable, allowing users to enable or disable this new FlashInfer MoE backend at runtime. - Parallelism Support: Enhanced the MoE layer to seamlessly integrate with various parallelism strategies (Data Parallelism, Expert Parallelism, Tensor Parallelism) when using the new FlashInfer backend.
- NVFP4 Quantization Handling: Implemented specific data preparation and finalization steps for NVFP4 quantized inputs and weights, including handling of scaling factors and adapting to FlashInfer's required weight layouts.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request integrates the Flashinfer CUTLASS MoE kernel for NVFP4, which is a valuable performance enhancement for Mixture of Experts models. The changes are well-structured, with the new logic mostly encapsulated in new files.
My review has identified a critical issue regarding an incorrect device capability check that will lead to a runtime error. Additionally, there's a high-severity concern about the use of unsafe tensor view operations, which poses a risk to future compatibility and maintainability. I've also noted a minor issue of dead code that should be cleaned up.
I recommend addressing these points to ensure the stability and long-term health of the codebase before merging.
This pull request has merge conflicts that must be resolved before it can be |
) -> tuple[torch.Tensor, Optional[torch.Tensor]]: | ||
if qtype == torch.float8_e4m3fn: | ||
return _fp8_quantize(A, A_scale, per_channel_quant, block_shape) | ||
elif qtype == torch.int8: | ||
return _int8_quantize(A, A_scale, per_channel_quant, block_shape) | ||
elif qtype == torch.uint8: # nvfp4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make qtype to be torch.dtype or scalar_types
and then use scalar_types.float4_e2m1f
?
7fcb48e
to
e4e78be
Compare
1a219c1
to
e589b4a
Compare
e589b4a
to
7cdb800
Compare
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: shuw <shuw@nvidia.com>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
def maybe_swap_experts_impl( | ||
self, | ||
moe_parallel_config: FusedMoEParallelConfig, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to leave a docstring on the purpose of this function and when to implement it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great work iterating on this. We will address the needed ModularKernel refactors for TP, Llama 4 support for this kernel, and CT NVFP4 integration in followup PRs. Thanks for your efforts.
I've validated the existing pathways for CT NVFP4 for Qwen3 and Llama 4 are working.
lm_eval --model vllm --model_args pretrained=nm-testing/Qwen3-30B-A3B-NVFP4,tensor_parallel_size=2,enforce_eager=True --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:32<00:00, 40.47it/s, est. speed input: 40182.41 toks/s, output: 5217.28 toks/s]
Running generate_until requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:32<00:00, 40.31it/s]
2025-07-17:13:07:34,275 INFO [lm_eval.loggers.evaluation_tracker:272] Output path not provided, skipping saving results aggregated
vllm (pretrained=nm-testing/Qwen3-30B-A3B-NVFP4,tensor_parallel_size=2,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8787|± |0.0090|
| | |strict-match | 5|exact_match|↑ |0.8711|± |0.0092|
lm_eval --model vllm --model_args pretrained=nm-testing/Qwen3-30B-A3B-NVFP4,tensor_parallel_size=2 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:25<00:00, 51.49it/s, est. speed input: 51125.17 toks/s, output: 6532.40 toks/s]
Running generate_until requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:25<00:00, 51.18it/s]
2025-07-17:13:18:04,157 INFO [lm_eval.loggers.evaluation_tracker:272] Output path not provided, skipping saving results aggregated
vllm (pretrained=nm-testing/Qwen3-30B-A3B-NVFP4,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8832|± |0.0088|
| | |strict-match | 5|exact_match|↑ |0.8810|± |0.0089|
It looks like we are initializing CUDA too early now, breaking many tests. I'm looking into it |
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: WorldExplored <srreyansh.sethi@gmail.com>
Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Himanshu Jaju <hj@mistral.ai>
Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: x22x22 <wadeking@qq.com>
Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>
This PR covers:
VLLM_USE_FLASHINFER_MOE
=1. It supportscutlass_moe_fp4
backend supports DP or TP or DP + TP. EP is not supported yet.Example usage
Use Data Parallel script and specify model as
nvidia/DeepSeek-R1-FP4
, quantization asmodelopt_fp4
andenable_expert_parallel
=TrueDP + EP
python data_parallel.py --dp-size=4 --tp-size=1
TP + EP, tested on B100x4
cc @kushanam @pavanimajety @trevor-m @kaixih