-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE #8450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kaixih, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request extends the system's Mixture-of-Experts (MoE) capabilities by integrating the Flashinfer block-scale FP8 backend for Tensor Parallel (TP) configurations. It introduces a new class and modifies the MoE dispatch and FP8 quantization logic to leverage this optimized kernel, decoupling its enablement from Expert Parallel (EP) MoE.
Highlights
- Flashinfer MoE Backend for TP MoE: This pull request enables the Flashinfer block-scale FP8 backend for Tensor Parallel (TP) Mixture-of-Experts (MoE) configurations, building upon previous work that enabled it for Expert Parallel (EP) MoE.
- New MoE Implementation Class: A new class,
FlashInferFusedMoE
, has been introduced to encapsulate the logic for Flashinfer-specific fused MoE operations, which is now conditionally selected by the system's MoE implementation dispatch logic. - FP8 Quantization Integration: The FP8 quantization module has been updated with a dedicated
apply
method to prepare inputs (including per-token group quantization and weight reordering) and invoke Flashinfer'strtllm_fp8_block_scale_moe
kernel for optimized computation. - Decoupled MoE Enablement: The requirement for
enable_ep_moe
to be true whenenable_flashinfer_trtllm_moe
is active has been removed, allowing the Flashinfer TRTLLM MoE backend to be utilized independently for TP MoE scenarios.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables the Flashinfer MoE blockscale fp8 backend for Tensor Parallelism (TP) MoE, extending the functionality from a previous PR that handled Expert Parallelism (EP) MoE. The changes introduce a new FlashInferFusedMoE
layer and adapt the FP8 quantization logic to support this new backend.
My review has identified a few issues:
- Critical Correctness Issue: A new
apply
method inFp8MoEMethod
overwrites an existing one, which will break the standard MoE path. This needs to be resolved by merging the two methods. - Performance: An inefficient weight-swapping operation is performed on every forward pass. This should be moved to the weight-loading stage to avoid runtime overhead.
- Maintainability: I've flagged several debug
print
statements that should be removed before merging, along with a hardcoded magic number and local imports that reduce code clarity.
The overall direction is good, but the identified issues, especially the critical one, must be addressed to ensure correctness and performance.
a83eb55
to
ad0c778
Compare
Accuracy test results:
To repro:
|
E2E perf shows ~10% improvement for low latency workloads:
To repro:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nits
2e79d5c
to
42a8b20
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow up tiny nit
Hi @kaixih may you rebase? Thanks! |
6e7da6c
to
a432df0
Compare
Thx for the headsup. Rebased and conflicts resolved. @zhyncs |
Note: this PR is also necessary for this merged PR (EP MoE + TRT-LLM kernel), since recent refactoring changes have broken it. The fix is included in this PR. |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…-project#8450) Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>
…-project#8450) Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>
Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>
Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>
Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>
…-project#8450) Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>
…-project#8450) Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>
A followup PR to enable Flashinfer MoE blockscale fp8 backend for TP MoE.
The previous PR is doing the same but for the EP MoE.
cc. @kushanam