-
-
Notifications
You must be signed in to change notification settings - Fork 10k
[Bugfix] Fused MoE Modular Kernel chunking loop #20392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fused MoE Modular Kernel chunking loop #20392
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @varun-sundar-rabindranath, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly refactors the chunking mechanism within the Fused Mixture of Experts (MoE) modular kernel. The changes aim to enhance the organization and maintainability of the code by extracting chunking responsibilities into dedicated functions. Additionally, it introduces a new, optimized kernel for counting tokens per expert, which is integrated into the chunking process to support more granular control and potential performance improvements.
Highlights
- Refactoring MoE Chunking Logic: The core chunking logic for the Fused MoE Modular Kernel has been refactored into dedicated private methods (
_do_fused_experts
and_maybe_chunk_fused_experts
) withinmodular_kernel.py
. This improves modularity and readability by separating the chunking orchestration from the direct application of fused experts. - New Expert Token Counting Kernel: A new Triton-based kernel,
_count_expert_num_tokens
, and its Python wrappercount_expert_num_tokens
have been introduced inutils.py
. This kernel efficiently counts the number of tokens assigned to each expert, which is now utilized during the chunked processing in the MoE kernel. - Output Shape Adjustment in DeepGEMM MoE: The
deep_gemm_moe.py
module has been updated to change the expected output shape from(M * topk, K)
to(M, topk, K)
in itsworkspace_shapes
method (line 86). Correspondingly, theapply
method now usesoutput.view((-1, K))
(line 159) to adapt to this new shape duringtorch.index_select
operations. - Comprehensive Testing for New Kernel: A new test file,
test_count_expert_num_tokens.py
, has been added to thoroughly validate thecount_expert_num_tokens
kernel across various configurations of tokens, top-k values, experts, and expert parallelism sizes. It includes a reference CPU implementation for correctness verification.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the chunking logic in the Fused MoE modular kernel and introduces a new utility function count_expert_num_tokens
with a corresponding Triton kernel and tests. The refactoring improves code structure, but I've identified a potential performance regression related to workspace memory allocation within the new chunking loop. Additionally, I've raised concerns about the output shape in deep_gemm_moe.py
and the reshaping of the output
tensor before torch.index_select
.
28044b0
to
5cec769
Compare
5cec769
to
ae7089e
Compare
c_expert_num_tokens = None | ||
if expert_num_tokens is not None: | ||
c_expert_num_tokens = slice_expert_num_tokens( | ||
c_topk_ids, local_num_experts, expert_map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expert_num_tokens
has the number tokens assigned to each expert. This needs to be updated / re-calculated during chunking for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a summary of why you're making these changes to the PR description?
tl.store(expert_num_tokens_ptr + curr_expert, tl.sum(acc)) | ||
|
||
|
||
def count_expert_num_tokens( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a docstring?
This pull request has merge conflicts that must be resolved before it can be |
d485f3c
to
a339feb
Compare
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm once merge conflicts are resolved
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
a339feb
to
2658f56
Compare
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Purpose
Bug 1:
TritonOrDeepGemmExperts
choosesDeepGemmExperts
for num_tokens > 128, it choosesTritonExperts
otherwise. Imagine a case, where num_tokens is 130 and the chunk size is 128. In this case, we'd needDeepGemmExperts::workspace_shapes
for the first chunk andTritonExperts::workspace_shapes
for the second chunk.Fix: Compute the workspace shapes individually for each chunk.
Bug 2 (soft) :
expert_num_tokens
. This tensor contains the number of tokens assigned to each expert. This isn't accounted for correctly.Fix : The PR introduces a
count_expert_num_tokens
kernel and computes the expert_num_tokens for each chunk individually. This is termed a "soft bug" as, for experts that support chunking,expert_num_tokens
is None / unused.As part of the fixes, this PR moves the chunking logic out of the main
FusedMoEModularKernel::forward
pass for clarity.Test Plan
Machine : H100
pytest :
pytest -s tests/kernels/moe/test_modular_kernel_combinations.py
from #20449Test Result
without chunking:
with VLLM_FUSED_MOE_CHUNK_SIZE=64
pytests: Pass
(Optional) Documentation Update