Remove 200us slow concat kernel (part 2: srt) #7020

fzyzcjy · 2025-06-10T02:27:10Z

Motivation

gsm8k and mmlu ok after integrating this pr to dev branch

maybe firstly have a brief look at this, then I will split into kernel pr and srt pr and remove temp code

before

after

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Hello @fzyzcjy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini or gemini-code-assist here, providing a summary of this pull request. This PR aims to improve performance by removing a potentially slow concatenation step before calling the cutlass_mla_decode kernel. Instead of concatenating the 'nope' (non-rotary) and 'pe' (positional encoding) parts of the query tensor (q) beforehand, the kernel and its surrounding code have been modified to accept q_nope and q_pe as separate input tensors. This change is propagated through the Python layer, C++ bindings, the CUDA kernel implementation, and the test suite. The DeepSeek v2 model implementation is also updated to utilize this new kernel signature.

Highlights

Performance Optimization: The primary goal is to eliminate a slow 200us concatenation operation by allowing the attention kernel to process the 'nope' and 'pe' components of the query tensor separately.
Kernel Signature Change: The cutlass_mla_decode kernel and its Python wrapper now accept q_nope and q_pe as two distinct tensors instead of a single concatenated q_nope_and_q_pe tensor.
Code Structure Update: Changes are made across multiple layers, including the Python attention backend, C++ kernel implementation, C++ bindings, and the Python test suite, to accommodate the new kernel signature.
Model Integration: The DeepSeek v2 model's attention core is updated to pass the split q_nope and q_pe tensors to the attention backend when using the cutlass_mla backend.

Changelog

Click here to see the changelog

python/sglang/srt/layers/attention/cutlass_mla_backend.py
- Modified forward_decode to accept optional q_rope and k_rope parameters (lines 237-238).
- Added logic to call set_mla_kv_buffer when k_rope is provided (lines 245-251).
- Reshaped the input q tensor and split it into q_nope and q_rope based on whether q_rope was provided (lines 261-269).
- Updated the call to cutlass_mla_decode to pass q_nope and q_pe separately (lines 277-278).
python/sglang/srt/models/deepseek_v2.py
- Added cutlass_mla to the list of attention backends that call self.attn_mqa with q_rope and k_rope in forward_absorb_core (lines 1013-1016).
sgl-kernel/csrc/attention/cutlass_mla_kernel.cu
- Commented out the CUDA version check block (lines 31-46, 268-269).
- Modified the args_from_options function to accept separate q_nope and q_pe tensors (lines 97-98).
- Updated the calculation of strides for q_nope and q_pe based on their individual strides (lines 129-132).
- Updated the args_from_options function to use separate pointers and strides for q_nope and q_pe (lines 143-152).
- Modified the runMla function signature to accept separate q_nope and q_pe tensors (lines 181-182).
- Updated the call to args_from_options within runMla to pass separate q_nope and q_pe (line 191).
- Modified the cutlass_mla_decode function signature to accept separate q_nope and q_pe tensors (lines 213-214).
- Updated device guarding and stream retrieval to use the q_nope tensor (lines 220-222).
- Updated calls to runMla within cutlass_mla_decode to pass separate q_nope and q_pe (lines 232, 235, 238).
sgl-kernel/csrc/common_extension.cc
- Updated the PyTorch C++ extension definition for cutlass_mla_decode to reflect the new signature with separate q_nope and q_pe tensors (line 62).
sgl-kernel/include/sgl_kernel_ops.h
- Updated the C++ header declaration for cutlass_mla_decode to match the new signature with separate q_nope and q_pe tensors (lines 108-109).
sgl-kernel/python/sgl_kernel/attention.py
- Modified the cutlass_mla_decode function signature to accept separate q_nope and q_pe tensors (lines 55-56).
- Updated assertions to check dimensions and shapes of the separate q_nope and q_pe tensors (lines 63-79).
- Modified padding logic to pad q_nope and q_pe separately if the number of heads is less than MAX_HEADS (lines 84-90).
- Updated dtype assertions to check q_nope and ensure all relevant inputs have the same dtype (lines 99-103).
- Updated the output tensor creation to use q_nope as a reference (line 111).
- Updated the call to the C++ extension torch.ops.sgl_kernel.cutlass_mla_decode.default to pass q_nope and q_pe separately (lines 115-116).
sgl-kernel/tests/test_cutlass_mla.py
- Added logic to split the input q tensor into q_nope and q_pe before calling cutlass_mla_decode (lines 89-91).
- Updated the call to cutlass_mla_decode to pass the newly created q_nope and q_pe tensors (line 96).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to improve performance by adapting the cutlass_mla_decode kernel and its callers to use split q_nope and q_pe tensors, thereby avoiding a potentially slow concatenation operation. The changes are consistently applied across the Python and CUDA C++ codebase.

However, there are a couple of important points to address, particularly regarding the CUDA version check and a potentially temporary include. The PR description also indicates this is a work-in-progress, so these comments are to help refine it before finalization.

Summary of Findings

CUDA Version Check Disabled: The preprocessor directive checking for CUDA version >= 12.4 in cutlass_mla_kernel.cu has been commented out. This is a critical issue as it might lead to build or runtime failures on systems not meeting this requirement if the kernel depends on features from CUDA 12.4+.
Inclusion of <iostream>: The <iostream> header was added to cutlass_mla_kernel.cu. This is often used for debugging and should typically be removed from production kernel code.
Variable Naming Consistency (q_rope vs q_pe): In python/sglang/srt/layers/attention/cutlass_mla_backend.py, the variable q_rope is used and passed as q_pe to the kernel. While functionally correct if they represent the same data, consistently using q_pe on the Python side where it maps to the kernel's q_pe might improve clarity. (Severity: low, not commented due to review settings)

Merge Readiness

This pull request makes significant progress in refactoring the attention mechanism to avoid a concatenation step, which is a positive step for performance. However, due to the critical issue of the disabled CUDA version check and the PR's work-in-progress nature (as indicated by the description and checklist), it is not ready for merging.

I recommend addressing the critical and medium severity comments. Specifically, the CUDA version check needs to be clarified and likely reinstated. Once these issues are resolved and the planned splitting into separate PRs and removal of temporary code are complete, it can be re-evaluated. I am unable to approve pull requests, so further review and approval by other maintainers will be necessary after the changes are made.

gemini-code-assist · 2025-06-10T02:29:14Z

sgl-kernel/csrc/attention/cutlass_mla_kernel.cu

+// #if !defined(CUDA_VERSION) || CUDA_VERSION < 12040
+// #error "dont be here"
+// void cutlass_mla_decode(
+//     torch::Tensor const& out,
+//     torch::Tensor const& q_nope,
+//     torch::Tensor const& q_pe,
+//     torch::Tensor const& kv_c_and_k_pe_cache,
+//     torch::Tensor const& seq_lens,
+//     torch::Tensor const& page_table,
+//     torch::Tensor const& workspace) {
+//   TORCH_CHECK(false, "CUDA version must be >= 12.4 for cutlass_mla_decode");
+// }
+// int64_t cutlass_mla_get_workspace_size(int64_t max_seq_len, int64_t num_batches, int64_t sm_count, int64_t num_kv_splits) {
+//   TORCH_CHECK(false, "CUDA version must be >= 12.4 for cutlass_mla_get_workspace_size");
+// }
+// #else


The CUDA version check (#if !defined(CUDA_VERSION) || CUDA_VERSION < 12040) and the corresponding error/fallback functions have been commented out. If the cutlass_mla_decode kernel indeed has a minimum CUDA version requirement (e.g., 12.4 as indicated), removing this check could lead to compilation errors or runtime failures on systems with older CUDA versions.

Could you clarify the reason for commenting this out? Is this a temporary change for development, or is the kernel now compatible with older CUDA versions? If it's temporary, it's crucial to restore this check or an equivalent mechanism before merging to ensure compatibility and prevent unexpected issues for users.

gemini-code-assist · 2025-06-10T02:29:14Z

sgl-kernel/csrc/attention/cutlass_mla_kernel.cu

@@ -22,25 +22,28 @@ limitations under the License.
 #include <torch/all.h>

 #include <cute/tensor.hpp>
+#include <iostream>


The <iostream> header has been included. Is this for debugging purposes (e.g., std::cout)? If so, it's generally recommended to remove such debugging headers and statements from production code to keep the build clean and avoid potential side effects or performance impacts.

fzyzcjy added 30 commits June 10, 2025 08:43

more

32d063f

more

d966c02

more

9c37e8a

more

1bcc8e6

more

b518d80

more

0f6d879

more

cbbe67f

more

8bcf73a

more

b7063ef

more

ccfe7f1

more

f187094

more

988e524

more

450c64d

more

3b4bb94

more

e206579

more

9a88919

more

d9bfa70

more

601df7f

more

ab7a434

more

b831918

more

4b6fa7e

more

3ea3cb5

more

46533ef

more

e336899

more

2cff0ee

more

9870f3d

more

dfd40ab

fmt

3d5cc0b

more

87f6e7b

more

96aa70e

fzyzcjy requested review from BBuf, yizhang2077, merrymercy, FlamingoPg, Ying1123, hnyls2002, ByronHsu, zhaochenyang20, HaiShaw and ch-wan as code owners June 10, 2025 02:27

gemini-code-assist bot reviewed Jun 10, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 10, 2025

View reviewed changes

bench

a1d7b22

zhyncs assigned fzyzcjy, Alcanderian and zhyncs Jun 10, 2025

zhyncs added the high priority label Jun 10, 2025

fzyzcjy and others added 3 commits June 11, 2025 07:48

Merge branch 'main' into feat/mla_cat

40afebe

Merge branch 'main' into feat/mla_cat

105eeac

Merge branch 'main' into feat/mla_cat

4016fe2

fzyzcjy changed the title ~~Remove 200us slow concat kernel~~ Remove 200us slow concat kernel (part 2: srt) Jun 13, 2025

fzyzcjy and others added 7 commits June 13, 2025 16:38

Merge branch 'main' into feat/mla_cat

059625e

Merge branch 'main' into feat/mla_cat

a59b9f0

Update cutlass_mla_kernel.cu

01b147c

Merge branch 'main' into feat/mla_cat

2dc987f

Merge branch 'main' into feat/mla_cat

f5bf73b

Merge branch 'main' into feat/mla_cat

08ccb87

Merge branch 'main' into feat/mla_cat

87a2950

zhyncs approved these changes Jun 13, 2025

View reviewed changes

zhyncs merged commit c49c1d9 into sgl-project:main Jun 13, 2025
30 of 50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove 200us slow concat kernel (part 2: srt) #7020

Remove 200us slow concat kernel (part 2: srt) #7020

Uh oh!

fzyzcjy commented Jun 10, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 10, 2025

Uh oh!

gemini-code-assist bot Jun 10, 2025

Uh oh!

Uh oh!

Uh oh!

Remove 200us slow concat kernel (part 2: srt) #7020

Remove 200us slow concat kernel (part 2: srt) #7020

Uh oh!

Conversation

fzyzcjy commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fzyzcjy commented Jun 10, 2025 •

edited

Loading