-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[1/2] sgl-kernel: Fuse routed scaling factor into select_experts #8364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @trevor-m, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a performance optimization for Mixture of Experts (MoE) layers, particularly for FP4 quantized models. The core change involves fusing the application of a routed_scaling_factor
directly into the expert selection kernel, which should lead to more efficient computations. This work is a direct follow-up to a previous related pull request.
Highlights
- MoE Optimization: I've implemented a fusion optimization for Mixture of Experts (MoE) layers, specifically targeting FP4 quantized models. This change moves the application of the
routed_scaling_factor
directly into the expert selection process. - Scaling Factor Fusion Logic: The
routed_scaling_factor
is now conditionally applied within the CUDA kernels responsible formoe_fused_gate
(expert selection) during the expert weight normalization. This aims to improve computational efficiency by reducing separate operations. - Conditional Activation: A new boolean parameter,
apply_routed_scaling_factor_on_output
, has been introduced to control this fusion. It is specifically enabled forFusedMoE
experts usingModelOptNvFp4FusedMoEMethod
in models like DeepSeekV2. - Shared Expert Fusion Compatibility: The previous restriction that disabled shared expert fusion when Flashinfer MoE was enabled has been removed, potentially allowing for more flexible and optimized configurations.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fuse the routed_scaling_factor
multiplication into the select_experts
function, particularly for FP4 MoE, to improve performance. The changes involve adding a new boolean parameter, apply_routed_scaling_factor_on_output
, which is propagated from the Python layers down to the CUDA kernels to control this fused operation.
Overall, the changes are logical and consistently implemented across the Python and C++/CUDA codebases. My review focuses on improving code robustness, maintainability, and readability. Key suggestions include:
- Replacing string-based type checking with
isinstance
for better maintainability. - Using
NotImplementedError
instead ofassert
for unimplemented paths to prevent silent failures in production. - Minor refactoring in both Python and CUDA code to improve clarity and reduce redundancy.
These improvements will make the code more robust and easier to maintain in the future.
0b7967e
to
7626d88
Compare
7626d88
to
acede7d
Compare
@merrymercy Could you please take a look? Thanks |
Thanks for the PR! Can we also do sanity checks for TP cases(i.e., --enable-ep-moe not passed in) and with FP8 model, to verify that there is no accuracy regression? |
fb5135a
to
44a2e5b
Compare
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
SGL_CUTLASS_MOE=1 FP8 path was giving 0.832 accuracy because we weren't multiplying by routed_scaling_Factor. I enabled the fusion to fix it and now it gives 0.965. |
d4325ea
to
7b1b7d4
Compare
c57b2d7
to
14123dd
Compare
14123dd
to
5032ff1
Compare
@trevor-m Please fix the broken unit test for the fused gate.
|
Motivation
Follow up to #8333
Fuse the multiply by routed_scaling_factor into select_experts, following example of TRT-LLM:
https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/models/modeling_deepseekv3.py#L323
https://github.com/NVIDIA/TensorRT-LLM/blob/738ab615930fd08dccb94fa388bd74dc91c5f235/cpp/tensorrt_llm/kernels/noAuxTcKernels.cu#L651
For the non-FP4 paths, the routed_scaling_factor is fused into moe_sum_reduce. However, we could move it into select_experts for those paths too if we wanted to simplify the code.
Modifications
Add boolean argument to fused_moe_gate()
Results
Prefill:
10.46% speedup at BS 1
1.86% speedup at BS 128
Decode:
1.22% speedup at BS1
0.26% speedup at BS128
Server command
BEFORE results
AFTER results
Accuracy
Checklist