TRTLLM Gen MLA Decode Kernel Integration #7938

farazkh80 · 2025-07-10T20:45:22Z

Motivation

This PR integrates TRTLLM-GEN MLA Decode kernel from flashinfer to sglang.

Modifications

Intorduced new mla backend option TRTLLMMLABackend in python/sglang/srt/layers/attention/trtllm_mla_backend.py.

Benchmarking

Low Concurrency Results TP=4 (4xB200)

Server Command: python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --trust-remote-code --attention-backend trtllm_mla/flashinfer/cutlass_mla --page-size 32/64/128 --tp-size 4 --max-running-requests 1 --cuda-graph-max-bs 1 -mem-fraction-static 0.90 |

Client Command: python -m sglang.bench_serving --backend sglang --host 0.0.0.0 --port 30000 --dataset-name random --random-input-len 1024 --random-output-len 8192 --num-prompts 1 --max-concurrency 1

Backend	Page Size	Requests	Max Concurrency	Achieved Concurrency	Output Throughput (tok/s)	Total Throughput (tok/s)
trtllm_mla	32	1	1	1.0	51.68	60.03
flashinfer	32	1	1	1.0	52.56	61.06
trtllm_mla	64	1	1	1.0	52.40	60.88
flashinfer	64	1	1	1.0	51.88	60.27
cutlass_mla	128	1	1	1.0	49.60	57.62

Note: the reason we don't observe any considerable perf gain in low concurrency is because the kernel time is only about 7% of e2e latency (23 µs for kernel out of 300 µs for one layer's forward path). The trtllm_mla kernel itself is 40% faster (17 µs for page-size 32 case which is 6 µs faster than flashinfer MLA's 23 µs) than flashinfer backend.
However there is an extra q_rope and q_nope concatenation step before calling trtllm_batch_decode_with_kv_cache_mla and an extra void flashinfer::zero_gmem_semaphore<int>(T1 *, int) inside flashinfer. These two extra steps add together another 5 µs which cancels out the 6 µs gain from the trtllm_batch_decode_with_kv_cache_mla kernel itself. (all these can be seen in the kernel wise comparison snapshot below)

High Concurrency Results

Server Command: python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --trust-remote-code --attention-backend trtllm_mla/flashinfer --page-size 32/64 --tp-size 8 --max-running-requests 512 --cuda-graph-max-bs 512 -mem-fraction-static 0.90 |

Client Command: python -m sglang.bench_serving --backend sglang --host 0.0.0.0 --port 30000 --dataset-name random --random-input-len 1024 --random-output-len 8192 --num-prompts 1024 --max-concurrency 512

Backend	Page Size	Requests	Max Concurrency	Achieved Concurrency	Output Throughput (tok/s)	Total Throughput (tok/s)
trtllm_mla	32	1024	512	393.25	4697.43	5273.36
flashinfer	32	1024	512	389.44	3311.30	3717.28
trtllm_mla	64	1024	512	390.85	4651.67	5221.99
flashinfer	64	1024	512	388.12	3355.26	3766.63
cutlass_mla	128	1024	512	389.99	3768.46	4230.50

Note: at high concurency the kernel is the major bottleneck, thus we observe the full 40% improvement in e2e perf when compared to flashinfer MLA.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
E2E DeepSeek R1 server launch and generation sanity
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @farazkh80, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request is an initial work-in-progress commit to integrate and enable TensorRT-LLM (TRTLLM) Multi-Head Latent Attention (MLA) kernels within the system. The primary objective is to enhance attention computation performance, particularly on Blackwell architectures, by utilizing these specialized kernels from FlashInfer. This lays the foundational groundwork for future performance improvements in decode operations.

Highlights

New Attention Backend: Introduced a new TRTLLMMLABackend to integrate TensorRT-LLM (TRTLLM) Multi-Head Latent Attention (MLA) kernels, leveraging the FlashInfer library for optimized decode operations.
Blackwell Optimization: The system now intelligently prioritizes the use of the new trtllm_mla attention backend on Blackwell (SM100) architectures, provided the specific FlashInfer kernels are available, aiming for improved performance.
Dependency Update: The flashinfer_python dependency in pyproject.toml has been updated to point to a specific Git commit, likely to access the necessary TRTLLM MLA features that are not yet in a released version.
Unit Testing: A new comprehensive test suite (test_trtllm_mla_backend.py) has been added to validate the functionality and correctness of the TRTLLM MLA backend across various configurations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for TRT-LLM MLA kernels. My review has identified a few areas for improvement:

The project now depends on a personal fork of flashinfer, which should be addressed for better maintainability.
The new TRTLLMMLABackend has some hardcoded values for model dimensions and workspace size that could be made more flexible.
There's a critical bug in TRTLLMMLABackend where an assertion will fail for MLA use cases.
The backend selection logic in model_runner.py uses a bare except clause which should be more specific.

python/sglang/srt/layers/attention/trtllm_mla_backend.py

python/pyproject.toml

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/layers/attention/trtllm_mla_backend.py

farazkh80 · 2025-07-15T03:10:54Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces the trtllm_mla attention backend, integrating TensorRT-LLM's Multi-Head Latent Attention kernels. The changes include the backend implementation, integration with the model runner, and a new test suite. Key areas for improvement include dependency management, KV cache preparation, and ensuring robustness for quantized models.

python/pyproject.toml

python/sglang/srt/layers/attention/trtllm_gen_mla_backend.py

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/layers/attention/utils.py

python/sglang/srt/layers/attention/trtllm_gen_mla_backend.py

farazkh80 · 2025-07-22T01:13:22Z

PR should be ready for an initial review. Only pending changes are waiting for flashinfer-ai/flashinfer#1289 to deduplicate kv-cache. Duplication of the kv-cache is the main bottleneck for e2e perf. As seen in, nsys kernel-wise comparision capture below.

Left hand-side is the flashinfer BatchMLAPagedAttention the one used by default for MLA, and right hand side is the new TRTLLM MLA kernel that this PR adds. This is done on high concurrency=512 on tp=8xB200.

farazkh80 · 2025-07-22T17:32:37Z

The kv-cache deduplication is merged now on flashinfer side flashinfer-ai/flashinfer#1289. I have reflected the changes in this PR and now at high concurrency we have 40% throughput improvement. This is currently using bf16 kv-cache for MLA, there will be a seperate PR in future to support fp8 kv-cache and query which should allows us to further improve perf and concurrency.

Backend	Page Size	Requests	Max Concurrency	Achieved Concurrency	Output Throughput (tok/s)	Total Throughput (tok/s)
trtllm_mla	32	1024	512	393.25	4697.43	5273.36
flashinfer	32	1024	512	389.44	3311.30	3717.28

merrymercy

please fix the lint

python/pyproject.toml

merrymercy

if possible, can you also attach a torch profile. Just to check whether the overlap scheduler works and there is no any cpu-gpu sync

docs/backend/attention_backend.md

python/sglang/test/attention/test_trtllm_mla_backend.py

docs/backend/attention_backend.md

Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

python/sglang/srt/model_executor/model_runner.py

pavanimajety

Please fix the file permissions for files that were modified to 755

Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

pavanimajety

LGTM, thanks!

Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

…gl-project#8632) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

…gl-project#8632) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

gemini-code-assist bot reviewed Jul 10, 2025

View reviewed changes

farazkh80 changed the title ~~[WIP] [DLFW-5721] trtllm gen mla initial commit~~ [WIP] [DLFW-5721] trtllm gen mla integration Jul 10, 2025

farazkh80 changed the title ~~[WIP] [DLFW-5721] trtllm gen mla integration~~ [WIP] trtllm gen mla integration Jul 14, 2025

gemini-code-assist bot reviewed Jul 15, 2025

View reviewed changes

farazkh80 marked this pull request as ready for review July 15, 2025 03:13

farazkh80 requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock, HaiShaw, ch-wan, BBuf and ByronHsu as code owners July 15, 2025 03:13

farazkh80 changed the title ~~[WIP] trtllm gen mla integration~~ TRTLLM gen mla integration Jul 15, 2025

farazkh80 changed the title ~~TRTLLM gen mla integration~~ TRTLLM Gen MLA Decode Kernel Integration Jul 15, 2025

merrymercy mentioned this pull request Jul 21, 2025

Development Roadmap (2025 H2) #7736

Open

1 task

farazkh80 requested a review from zhaochenyang20 as a code owner July 22, 2025 01:41

merrymercy reviewed Jul 23, 2025

View reviewed changes

python/pyproject.toml Outdated Show resolved Hide resolved

merrymercy reviewed Jul 23, 2025

View reviewed changes

docs/backend/attention_backend.md Show resolved Hide resolved

merrymercy reviewed Jul 23, 2025

View reviewed changes

python/sglang/test/attention/test_trtllm_mla_backend.py Show resolved Hide resolved

yyihuang reviewed Jul 23, 2025

View reviewed changes

docs/backend/attention_backend.md Show resolved Hide resolved

farazkh80 requested review from rkooo567 and kssteven418 as code owners July 24, 2025 19:33

farazkh80 force-pushed the fkhoubsirat-trtllm_gen_mla_sglang branch from fd9b07f to 8f8c478 Compare July 24, 2025 20:57

farazkh80 and others added 2 commits July 29, 2025 16:40

add todo comment

a39e817

Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang

a52db2a

kushanam enabled auto-merge (squash) July 30, 2025 06:30

kushanam approved these changes Jul 30, 2025

View reviewed changes

Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang

348c22a

kushanam self-requested a review July 30, 2025 06:49

kushanam disabled auto-merge July 30, 2025 07:04

kushanam and others added 3 commits July 30, 2025 08:21

Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang

67b73a2

Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang

cd77760

Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang

f4746cd

farazkh80 mentioned this pull request Jul 30, 2025

[Bug] Decode OOM on DSR1 fp8 using flashinfer backend high concurrency 512 #8585

Closed

5 tasks

pavanimajety reviewed Jul 30, 2025

View reviewed changes

python/sglang/srt/model_executor/model_runner.py Show resolved Hide resolved

pavanimajety suggested changes Jul 30, 2025

View reviewed changes

farazkh80 added 2 commits July 30, 2025 18:05

perm change

aa9764f

Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang

b6654e4

pavanimajety approved these changes Jul 31, 2025

View reviewed changes

yyihuang approved these changes Jul 31, 2025

View reviewed changes

Merge branch 'main' into fkhoubsirat-trtllm_gen_mla_sglang

71bf4ae

zhyncs self-assigned this Jul 31, 2025

zhyncs added the high priority label Jul 31, 2025

farazkh80 mentioned this pull request Jul 31, 2025

TRTLLM Gen MLA Decode Kernel Integration (same as #7938) #8632

Merged

zhyncs closed this Jul 31, 2025

zhyncs pushed a commit that referenced this pull request Jul 31, 2025

TRTLLM Gen MLA Decode Kernel Integration (same as #7938) (#8632)

4b04998

Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

huangzhilin-hzl pushed a commit to huangzhilin-hzl/sglang that referenced this pull request Aug 1, 2025

TRTLLM Gen MLA Decode Kernel Integration (same as sgl-project#7938) (s…

b2d5132

…gl-project#8632) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

TianQiLin666666 pushed a commit to TianQiLin666666/sglang that referenced this pull request Aug 1, 2025

TRTLLM Gen MLA Decode Kernel Integration (same as sgl-project#7938) (s…

9edd5c0

…gl-project#8632) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

lifuhuang pushed a commit that referenced this pull request Aug 3, 2025

TRTLLM Gen MLA Decode Kernel Integration (same as #7938) (#8632)

6019cba

Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025

TRTLLM Gen MLA Decode Kernel Integration (same as #7938) (#8632)

08a83dc

Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025

TRTLLM Gen MLA Decode Kernel Integration (same as #7938) (#8632)

4535379

Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

TRTLLM Gen MLA Decode Kernel Integration (same as sgl-project#7938) (s…

7acfe9d

…gl-project#8632) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 18, 2025

TRTLLM Gen MLA Decode Kernel Integration (same as sgl-project#7938) (s…

b99a61b

…gl-project#8632) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>

TRTLLM Gen MLA Decode Kernel Integration #7938

TRTLLM Gen MLA Decode Kernel Integration #7938

Uh oh!

Conversation

farazkh80 commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmarking

Low Concurrency Results TP=4 (4xB200)

High Concurrency Results

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

farazkh80 commented Jul 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

farazkh80 commented Jul 22, 2025

Uh oh!

farazkh80 commented Jul 22, 2025

Uh oh!

merrymercy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

merrymercy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

farazkh80 commented Jul 10, 2025 •

edited

Loading