-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
[feat]: add SM100 support for cutlass FP8 groupGEMM #20447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @djmmoss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces comprehensive support for NVIDIA's SM100 (Blackwell) architecture for Grouped GEMM (General Matrix Multiply) operations within the CUTLASS library. This enhancement specifically targets FP8 quantized Mixture-of-Experts (MoE) models, expanding the hardware compatibility for efficient MoE inference and allowing vLLM to leverage the latest NVIDIA GPU capabilities.
Highlights
- SM100 GroupGEMM Support: Added a new CUDA kernel (
grouped_mm_c3x_sm100.cu
) and associated build configurations to enable GroupGEMM operations on NVIDIA's SM100 (Blackwell) architecture, specifically for FP8 quantized Mixture-of-Experts (MoE) models. - Generalized CUTLASS Kernel: Refactored the core CUTLASS GroupGEMM template (
grouped_mm_c3x.cuh
) to be architecture-agnostic, allowing it to be used for both SM90 (Hopper) and SM100 (Blackwell) by passing theArchTag
as a template parameter. - Dynamic Kernel Dispatch & Build System Updates: Updated the C++ and Python logic to dynamically select the appropriate GroupGEMM kernel (SM90 or SM100) based on the detected GPU architecture and CUDA compiler version.
CMakeLists.txt
was modified to conditionally compile the SM100 kernel when CUDA 12.8 or newer is available and targeting SM100 architectures.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for SM100 (Blackwell) architecture for CUTLASS grouped GEMM operations, which is essential for running FP8 quantized Mixture-of-Experts (MoE) models. The changes are comprehensive, touching CMake build files, CUDA C++ kernels, and Python-level logic.
My review has identified a couple of critical issues that need to be addressed. There is an incorrect runtime dispatch logic in one of the C++ files that could lead to calling the wrong kernel, and a logical error in the Python code for detecting supported quantization schemes due to operator precedence. I've provided specific suggestions to fix these issues. Addressing these points will ensure the new functionality is robust and correct.
#if defined ENABLE_CUTLASS_MOE_SM100 && ENABLE_CUTLASS_MOE_SM100 | ||
cutlass_moe_mm_sm100(out_tensors, a_tensors, b_tensors, a_scales, b_scales, | ||
expert_offsets, problem_sizes, a_strides, b_strides, | ||
c_strides, per_act_token, per_out_ch); | ||
return; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dispatch logic here is incorrect. It relies only on compile-time preprocessor directives (#if defined...
) and doesn't use the runtime SM version (version_num
) for dispatching. This will lead to runtime errors if the code is compiled for multiple architectures (e.g., both SM90 and SM100).
For instance, if both ENABLE_CUTLASS_MOE_SM100
and ENABLE_CUTLASS_MOE_SM90
are defined, this code will always try to call cutlass_moe_mm_sm100
, even when running on an SM90 GPU, which will fail.
The dispatch logic should be based on the version_num
queried at runtime, similar to how cutlass_scaled_mm
is implemented in this file. Please update this block and the subsequent one for SM90 to use version_num
for correct runtime dispatch.
#if defined ENABLE_CUTLASS_MOE_SM100 && ENABLE_CUTLASS_MOE_SM100
if (version_num >= 100) {
cutlass_moe_mm_sm100(out_tensors, a_tensors, b_tensors, a_scales, b_scales,
expert_offsets, problem_sizes, a_strides, b_strides,
c_strides, per_act_token, per_out_ch);
return;
}
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@djmmoss I think Gemini is actually correct here; we should also add runtime switches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair, I've added the check in
return ( | ||
self._check_scheme_supported(90, error=False, match_exact=True) | ||
or self._check_scheme_supported(100, error=False, match_exact=True) | ||
and self._is_fp8_w8a8(weight_quant, input_quant)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a logical error in this condition due to Python's operator precedence, where and
is evaluated before or
.
The current logic is equivalent to:
self._check_scheme_supported(...) or (self._check_scheme_supported(...) and self._is_fp8_w8a8(...))
This means if SM90 is supported, the function will return True
even if the quantization scheme is not w8a8
, which is incorrect.
You should add parentheses to group the or
condition to ensure the architecture check is performed before checking the quantization scheme.
return ( | |
self._check_scheme_supported(90, error=False, match_exact=True) | |
or self._check_scheme_supported(100, error=False, match_exact=True) | |
and self._is_fp8_w8a8(weight_quant, input_quant)) | |
return ( | |
(self._check_scheme_supported(90, error=False, match_exact=True) or | |
self._check_scheme_supported(100, error=False, match_exact=True)) and | |
self._is_fp8_w8a8(weight_quant, input_quant)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work! Could you also add some benchmark comparison with Triton?
elif quant_config._is_fp4a4_nvfp4(weight_quant, input_quant): | ||
return CompressedTensorsW4A4MoeMethod() | ||
elif quant_config._is_fp8_w8a8_sm90(weight_quant, input_quant): | ||
elif quant_config._is_fp8_w8a8_sm90_or_sm100(weight_quant, | ||
input_quant): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be divided to _is_fp8_w8a8_sm90
and _is_fp8_w8a8_sm100
two branchs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we don't need to have complicated logic in _is_fp8_w8a8_sm90_or_sm100
, and may be better for the future architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed 👍
Since there are no tuned configs in this PR, I expect this will give poor performance. @djmmoss do you plan on including those in this PR or a followup? This seems reasonable to me at a kernel level |
@mgoin we are planning to add the tuned configs to this PR, please hold off on merging until they are in place 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are planning to add the tuned configs to this PR, please hold off on merging until they are in place
This pull request has merge conflicts that must be resolved before it can be |
cf2d006
to
dac867e
Compare
Kernel perf benchmark
|
@mgoin the PR is read to go. The cutlass perf has been tuned. We implemented a fallback logic to triton MOE in small batch size because we noticed the extra kernels overhead using cutlass covers the group gemm perf gain, which lead to worse e2e perf in low latency cases. |
@jiahanc I made a swap_ab change last week vllm/csrc/quantization/cutlass_w8a8/moe/moe_data.cu Lines 10 to 46 in f37db45
the current implementation looks exactly the same especially on handing the problem_sizes in moe_data.cu . Do you think we can merge mine first and do a rebase on top of it?
|
@shixianc , sure, we can hold off the PR till yours merged. |
hold off till #20911 merge first |
0f0697b
to
89eb69c
Compare
This pull request has merge conflicts that must be resolved before it can be |
Nice! Please resolve the merge conflict and I'll enable full CI |
89eb69c
to
2de546c
Compare
@mgoin Finished rebase, may you help review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM nice work, just one request for documentation
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Outdated
Show resolved
Hide resolved
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for the support!
Head branch was pushed to by a user without write access
@mgoin The CI failure seems to be unrelated to the change. May you help take a look and force merge if it looks good? |
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: qizixi <qizixi@meta.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: x22x22 <wadeking@qq.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Duncan Moss <djm.moss@gmail.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Add GroupGEMM support for SM100
Test Plan
Test Result
(Optional) Documentation Update