-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Remove 200us slow concat kernel (part 2: srt) #7020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini or gemini-code-assist here, providing a summary of this pull request. This PR aims to improve performance by removing a potentially slow concatenation step before calling the cutlass_mla_decode
kernel. Instead of concatenating the 'nope' (non-rotary) and 'pe' (positional encoding) parts of the query tensor (q
) beforehand, the kernel and its surrounding code have been modified to accept q_nope
and q_pe
as separate input tensors. This change is propagated through the Python layer, C++ bindings, the CUDA kernel implementation, and the test suite. The DeepSeek v2 model implementation is also updated to utilize this new kernel signature.
Highlights
- Performance Optimization: The primary goal is to eliminate a slow 200us concatenation operation by allowing the attention kernel to process the 'nope' and 'pe' components of the query tensor separately.
- Kernel Signature Change: The
cutlass_mla_decode
kernel and its Python wrapper now acceptq_nope
andq_pe
as two distinct tensors instead of a single concatenatedq_nope_and_q_pe
tensor. - Code Structure Update: Changes are made across multiple layers, including the Python attention backend, C++ kernel implementation, C++ bindings, and the Python test suite, to accommodate the new kernel signature.
- Model Integration: The DeepSeek v2 model's attention core is updated to pass the split
q_nope
andq_pe
tensors to the attention backend when using thecutlass_mla
backend.
Changelog
Click here to see the changelog
- python/sglang/srt/layers/attention/cutlass_mla_backend.py
- Modified
forward_decode
to accept optionalq_rope
andk_rope
parameters (lines 237-238). - Added logic to call
set_mla_kv_buffer
whenk_rope
is provided (lines 245-251). - Reshaped the input
q
tensor and split it intoq_nope
andq_rope
based on whetherq_rope
was provided (lines 261-269). - Updated the call to
cutlass_mla_decode
to passq_nope
andq_pe
separately (lines 277-278).
- Modified
- python/sglang/srt/models/deepseek_v2.py
- Added
cutlass_mla
to the list of attention backends that callself.attn_mqa
withq_rope
andk_rope
inforward_absorb_core
(lines 1013-1016).
- Added
- sgl-kernel/csrc/attention/cutlass_mla_kernel.cu
- Commented out the CUDA version check block (lines 31-46, 268-269).
- Modified the
args_from_options
function to accept separateq_nope
andq_pe
tensors (lines 97-98). - Updated the calculation of strides for
q_nope
andq_pe
based on their individual strides (lines 129-132). - Updated the
args_from_options
function to use separate pointers and strides forq_nope
andq_pe
(lines 143-152). - Modified the
runMla
function signature to accept separateq_nope
andq_pe
tensors (lines 181-182). - Updated the call to
args_from_options
withinrunMla
to pass separateq_nope
andq_pe
(line 191). - Modified the
cutlass_mla_decode
function signature to accept separateq_nope
andq_pe
tensors (lines 213-214). - Updated device guarding and stream retrieval to use the
q_nope
tensor (lines 220-222). - Updated calls to
runMla
withincutlass_mla_decode
to pass separateq_nope
andq_pe
(lines 232, 235, 238).
- sgl-kernel/csrc/common_extension.cc
- Updated the PyTorch C++ extension definition for
cutlass_mla_decode
to reflect the new signature with separateq_nope
andq_pe
tensors (line 62).
- Updated the PyTorch C++ extension definition for
- sgl-kernel/include/sgl_kernel_ops.h
- Updated the C++ header declaration for
cutlass_mla_decode
to match the new signature with separateq_nope
andq_pe
tensors (lines 108-109).
- Updated the C++ header declaration for
- sgl-kernel/python/sgl_kernel/attention.py
- Modified the
cutlass_mla_decode
function signature to accept separateq_nope
andq_pe
tensors (lines 55-56). - Updated assertions to check dimensions and shapes of the separate
q_nope
andq_pe
tensors (lines 63-79). - Modified padding logic to pad
q_nope
andq_pe
separately if the number of heads is less than MAX_HEADS (lines 84-90). - Updated dtype assertions to check
q_nope
and ensure all relevant inputs have the same dtype (lines 99-103). - Updated the output tensor creation to use
q_nope
as a reference (line 111). - Updated the call to the C++ extension
torch.ops.sgl_kernel.cutlass_mla_decode.default
to passq_nope
andq_pe
separately (lines 115-116).
- Modified the
- sgl-kernel/tests/test_cutlass_mla.py
- Added logic to split the input
q
tensor intoq_nope
andq_pe
before callingcutlass_mla_decode
(lines 89-91). - Updated the call to
cutlass_mla_decode
to pass the newly createdq_nope
andq_pe
tensors (line 96).
- Added logic to split the input
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to improve performance by adapting the cutlass_mla_decode
kernel and its callers to use split q_nope
and q_pe
tensors, thereby avoiding a potentially slow concatenation operation. The changes are consistently applied across the Python and CUDA C++ codebase.
However, there are a couple of important points to address, particularly regarding the CUDA version check and a potentially temporary include. The PR description also indicates this is a work-in-progress, so these comments are to help refine it before finalization.
Summary of Findings
- CUDA Version Check Disabled: The preprocessor directive checking for CUDA version >= 12.4 in
cutlass_mla_kernel.cu
has been commented out. This is a critical issue as it might lead to build or runtime failures on systems not meeting this requirement if the kernel depends on features from CUDA 12.4+. - Inclusion of
<iostream>
: The<iostream>
header was added tocutlass_mla_kernel.cu
. This is often used for debugging and should typically be removed from production kernel code. - Variable Naming Consistency (q_rope vs q_pe): In
python/sglang/srt/layers/attention/cutlass_mla_backend.py
, the variableq_rope
is used and passed asq_pe
to the kernel. While functionally correct if they represent the same data, consistently usingq_pe
on the Python side where it maps to the kernel'sq_pe
might improve clarity. (Severity: low, not commented due to review settings)
Merge Readiness
This pull request makes significant progress in refactoring the attention mechanism to avoid a concatenation step, which is a positive step for performance. However, due to the critical issue of the disabled CUDA version check and the PR's work-in-progress nature (as indicated by the description and checklist), it is not ready for merging.
I recommend addressing the critical
and medium
severity comments. Specifically, the CUDA version check needs to be clarified and likely reinstated. Once these issues are resolved and the planned splitting into separate PRs and removal of temporary code are complete, it can be re-evaluated. I am unable to approve pull requests, so further review and approval by other maintainers will be necessary after the changes are made.
// #if !defined(CUDA_VERSION) || CUDA_VERSION < 12040 | ||
// #error "dont be here" | ||
// void cutlass_mla_decode( | ||
// torch::Tensor const& out, | ||
// torch::Tensor const& q_nope, | ||
// torch::Tensor const& q_pe, | ||
// torch::Tensor const& kv_c_and_k_pe_cache, | ||
// torch::Tensor const& seq_lens, | ||
// torch::Tensor const& page_table, | ||
// torch::Tensor const& workspace) { | ||
// TORCH_CHECK(false, "CUDA version must be >= 12.4 for cutlass_mla_decode"); | ||
// } | ||
// int64_t cutlass_mla_get_workspace_size(int64_t max_seq_len, int64_t num_batches, int64_t sm_count, int64_t num_kv_splits) { | ||
// TORCH_CHECK(false, "CUDA version must be >= 12.4 for cutlass_mla_get_workspace_size"); | ||
// } | ||
// #else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CUDA version check (#if !defined(CUDA_VERSION) || CUDA_VERSION < 12040
) and the corresponding error/fallback functions have been commented out. If the cutlass_mla_decode
kernel indeed has a minimum CUDA version requirement (e.g., 12.4 as indicated), removing this check could lead to compilation errors or runtime failures on systems with older CUDA versions.
Could you clarify the reason for commenting this out? Is this a temporary change for development, or is the kernel now compatible with older CUDA versions? If it's temporary, it's crucial to restore this check or an equivalent mechanism before merging to ensure compatibility and prevent unexpected issues for users.
@@ -22,25 +22,28 @@ limitations under the License. | |||
#include <torch/all.h> | |||
|
|||
#include <cute/tensor.hpp> | |||
#include <iostream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Motivation
gsm8k and mmlu ok after integrating this pr to dev branch
maybe firstly have a brief look at this, then I will split into kernel pr and srt pr and remove temp code
before
after
Modifications
Checklist