Skip to content

Conversation

fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Jun 10, 2025

Motivation

gsm8k and mmlu ok after integrating this pr to dev branch

maybe firstly have a brief look at this, then I will split into kernel pr and srt pr and remove temp code

before

image

after

image

Modifications

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini or gemini-code-assist here, providing a summary of this pull request. This PR aims to improve performance by removing a potentially slow concatenation step before calling the cutlass_mla_decode kernel. Instead of concatenating the 'nope' (non-rotary) and 'pe' (positional encoding) parts of the query tensor (q) beforehand, the kernel and its surrounding code have been modified to accept q_nope and q_pe as separate input tensors. This change is propagated through the Python layer, C++ bindings, the CUDA kernel implementation, and the test suite. The DeepSeek v2 model implementation is also updated to utilize this new kernel signature.

Highlights

  • Performance Optimization: The primary goal is to eliminate a slow 200us concatenation operation by allowing the attention kernel to process the 'nope' and 'pe' components of the query tensor separately.
  • Kernel Signature Change: The cutlass_mla_decode kernel and its Python wrapper now accept q_nope and q_pe as two distinct tensors instead of a single concatenated q_nope_and_q_pe tensor.
  • Code Structure Update: Changes are made across multiple layers, including the Python attention backend, C++ kernel implementation, C++ bindings, and the Python test suite, to accommodate the new kernel signature.
  • Model Integration: The DeepSeek v2 model's attention core is updated to pass the split q_nope and q_pe tensors to the attention backend when using the cutlass_mla backend.

Changelog

Click here to see the changelog
  • python/sglang/srt/layers/attention/cutlass_mla_backend.py
    • Modified forward_decode to accept optional q_rope and k_rope parameters (lines 237-238).
    • Added logic to call set_mla_kv_buffer when k_rope is provided (lines 245-251).
    • Reshaped the input q tensor and split it into q_nope and q_rope based on whether q_rope was provided (lines 261-269).
    • Updated the call to cutlass_mla_decode to pass q_nope and q_pe separately (lines 277-278).
  • python/sglang/srt/models/deepseek_v2.py
    • Added cutlass_mla to the list of attention backends that call self.attn_mqa with q_rope and k_rope in forward_absorb_core (lines 1013-1016).
  • sgl-kernel/csrc/attention/cutlass_mla_kernel.cu
    • Commented out the CUDA version check block (lines 31-46, 268-269).
    • Modified the args_from_options function to accept separate q_nope and q_pe tensors (lines 97-98).
    • Updated the calculation of strides for q_nope and q_pe based on their individual strides (lines 129-132).
    • Updated the args_from_options function to use separate pointers and strides for q_nope and q_pe (lines 143-152).
    • Modified the runMla function signature to accept separate q_nope and q_pe tensors (lines 181-182).
    • Updated the call to args_from_options within runMla to pass separate q_nope and q_pe (line 191).
    • Modified the cutlass_mla_decode function signature to accept separate q_nope and q_pe tensors (lines 213-214).
    • Updated device guarding and stream retrieval to use the q_nope tensor (lines 220-222).
    • Updated calls to runMla within cutlass_mla_decode to pass separate q_nope and q_pe (lines 232, 235, 238).
  • sgl-kernel/csrc/common_extension.cc
    • Updated the PyTorch C++ extension definition for cutlass_mla_decode to reflect the new signature with separate q_nope and q_pe tensors (line 62).
  • sgl-kernel/include/sgl_kernel_ops.h
    • Updated the C++ header declaration for cutlass_mla_decode to match the new signature with separate q_nope and q_pe tensors (lines 108-109).
  • sgl-kernel/python/sgl_kernel/attention.py
    • Modified the cutlass_mla_decode function signature to accept separate q_nope and q_pe tensors (lines 55-56).
    • Updated assertions to check dimensions and shapes of the separate q_nope and q_pe tensors (lines 63-79).
    • Modified padding logic to pad q_nope and q_pe separately if the number of heads is less than MAX_HEADS (lines 84-90).
    • Updated dtype assertions to check q_nope and ensure all relevant inputs have the same dtype (lines 99-103).
    • Updated the output tensor creation to use q_nope as a reference (line 111).
    • Updated the call to the C++ extension torch.ops.sgl_kernel.cutlass_mla_decode.default to pass q_nope and q_pe separately (lines 115-116).
  • sgl-kernel/tests/test_cutlass_mla.py
    • Added logic to split the input q tensor into q_nope and q_pe before calling cutlass_mla_decode (lines 89-91).
    • Updated the call to cutlass_mla_decode to pass the newly created q_nope and q_pe tensors (line 96).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve performance by adapting the cutlass_mla_decode kernel and its callers to use split q_nope and q_pe tensors, thereby avoiding a potentially slow concatenation operation. The changes are consistently applied across the Python and CUDA C++ codebase.

However, there are a couple of important points to address, particularly regarding the CUDA version check and a potentially temporary include. The PR description also indicates this is a work-in-progress, so these comments are to help refine it before finalization.

Summary of Findings

  • CUDA Version Check Disabled: The preprocessor directive checking for CUDA version >= 12.4 in cutlass_mla_kernel.cu has been commented out. This is a critical issue as it might lead to build or runtime failures on systems not meeting this requirement if the kernel depends on features from CUDA 12.4+.
  • Inclusion of <iostream>: The <iostream> header was added to cutlass_mla_kernel.cu. This is often used for debugging and should typically be removed from production kernel code.
  • Variable Naming Consistency (q_rope vs q_pe): In python/sglang/srt/layers/attention/cutlass_mla_backend.py, the variable q_rope is used and passed as q_pe to the kernel. While functionally correct if they represent the same data, consistently using q_pe on the Python side where it maps to the kernel's q_pe might improve clarity. (Severity: low, not commented due to review settings)

Merge Readiness

This pull request makes significant progress in refactoring the attention mechanism to avoid a concatenation step, which is a positive step for performance. However, due to the critical issue of the disabled CUDA version check and the PR's work-in-progress nature (as indicated by the description and checklist), it is not ready for merging.

I recommend addressing the critical and medium severity comments. Specifically, the CUDA version check needs to be clarified and likely reinstated. Once these issues are resolved and the planned splitting into separate PRs and removal of temporary code are complete, it can be re-evaluated. I am unable to approve pull requests, so further review and approval by other maintainers will be necessary after the changes are made.

Comment on lines 31 to 46
// #if !defined(CUDA_VERSION) || CUDA_VERSION < 12040
// #error "dont be here"
// void cutlass_mla_decode(
// torch::Tensor const& out,
// torch::Tensor const& q_nope,
// torch::Tensor const& q_pe,
// torch::Tensor const& kv_c_and_k_pe_cache,
// torch::Tensor const& seq_lens,
// torch::Tensor const& page_table,
// torch::Tensor const& workspace) {
// TORCH_CHECK(false, "CUDA version must be >= 12.4 for cutlass_mla_decode");
// }
// int64_t cutlass_mla_get_workspace_size(int64_t max_seq_len, int64_t num_batches, int64_t sm_count, int64_t num_kv_splits) {
// TORCH_CHECK(false, "CUDA version must be >= 12.4 for cutlass_mla_get_workspace_size");
// }
// #else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The CUDA version check (#if !defined(CUDA_VERSION) || CUDA_VERSION < 12040) and the corresponding error/fallback functions have been commented out. If the cutlass_mla_decode kernel indeed has a minimum CUDA version requirement (e.g., 12.4 as indicated), removing this check could lead to compilation errors or runtime failures on systems with older CUDA versions.

Could you clarify the reason for commenting this out? Is this a temporary change for development, or is the kernel now compatible with older CUDA versions? If it's temporary, it's crucial to restore this check or an equivalent mechanism before merging to ensure compatibility and prevent unexpected issues for users.

@@ -22,25 +22,28 @@ limitations under the License.
#include <torch/all.h>

#include <cute/tensor.hpp>
#include <iostream>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The <iostream> header has been included. Is this for debugging purposes (e.g., std::cout)? If so, it's generally recommended to remove such debugging headers and statements from production code to keep the build clean and avoid potential side effects or performance impacts.

@fzyzcjy fzyzcjy changed the title Remove 200us slow concat kernel Remove 200us slow concat kernel (part 2: srt) Jun 13, 2025
@zhyncs zhyncs merged commit c49c1d9 into sgl-project:main Jun 13, 2025
30 of 50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants