-
Notifications
You must be signed in to change notification settings - Fork 2.8k
update sgl-kernel for EP: kernel part #8514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @ch-wan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request is a work-in-progress (WIP) aimed at updating the Mixture of Experts (MoE) kernel, specifically focusing on the moe_align_block_size
function and its underlying implementations. The core changes involve adjusting array indexing and buffer sizing in the Python, C++, and CUDA components to resolve potential off-by-one errors or to align with a revised indexing scheme for expert IDs, which is critical for accurate token routing and padding in MoE operations.
Highlights
- MoE Kernel Indexing Adjustments: I've made consistent modifications across the Python, C++, and CUDA kernels to adjust array indices (e.g.,
topk_ids[i] + 1
,e - 1
,left - 2
) when processing expert IDs and token counts within themoe_align_block_size
logic. This likely addresses off-by-one errors or aligns with a new indexing scheme. - Buffer Size Correction: The
cumsum_buffer
in the Python layer (fused_moe.py
) has been expanded from(num_experts + 1,)
to(num_experts + 2,)
, indicating a need for an additional slot to correctly handle cumulative sums in the MoE alignment process. - Parameter Alignment for Native Kernels: The
num_experts
parameter passed from the Pythonmoe_align_block_size
function to the underlyingsgl_moe_align_block_size
kernel is nownum_experts + 1
. This suggests a change in how the number of experts is interpreted or indexed by the native C++/CUDA implementations. - Debugging Aid: A
print(num_experts)
statement was temporarily added in the Pythonmoe_align_block_size
function, likely for debugging purposes during the development of these changes.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the Mixture-of-Experts (MoE) kernels for expert parallelism (EP). The changes consistently apply a +1
shift to expert IDs across Python, C++, and CUDA files, likely to switch to a 1-based indexing scheme internally. The logic appears consistent and correct within the scope of this change.
I've identified a leftover debug print
statement in a Python file, which I've marked as high
as it could cause issues in a production environment. Please ensure this is removed. Additionally, I've raised concerns about potential out-of-bounds access in C++ and CUDA code due to the -1
and -2
adjustments to expert IDs, which need careful review.
@@ -579,7 +579,7 @@ def moe_align_block_size( | |||
num_tokens_post_pad = torch.empty((1), dtype=torch.int32, device=topk_ids.device) | |||
|
|||
cumsum_buffer = torch.empty( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
…buffer (#8526) Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Motivation
In EP, we set the expert ids for filtered experts as -1. We update sgl-kernel to handle this case.
Modifications
Accuracy Test
Benchmark & Profiling
Checklist