Optimize Triton decoding kernel for long context #2394

ispobock · 2024-12-08T08:51:55Z

Motivation

As mentioned in #2271, the original triton decoding kernel has significant performance degradation on long context. We refactored the kernel and adapted the flash decoding implementation from lightllm. Currently, the long context speed decay has been alleviated a lot.

Benchmark

Tested for input 128, output 2048.

Triton (this PR) num_kv_splits=8: 150->138

$ python3 -m sglang.bench_offline_throughput --model meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompt 1 --random-input 128 --random-output 2048 --random-range 1 --attention-backend triton

[2024-12-08 08:37:24 TP0] Decode batch. #running-req: 1, #token: 194, token usage: 0.00, gen throughput (token/s): 150.66, #queue-req: 0

[2024-12-08 08:37:37 TP0] Decode batch. #running-req: 1, #token: 2154, token usage: 0.00, gen throughput (token/s): 138.86, #queue-req: 0

We can increase the --triton-attention-num-kv-splits to get better performance on long context.

Triton (this PR) num_kv_splits=16: 150->144

python3 -m sglang.bench_offline_throughput --model meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompt 1 --random-input 128 --random-output 2048 --random-range 1 --attention-backend triton --triton-attention-num-kv-splits 16

[2024-12-08 08:40:28 TP0] Decode batch. #running-req: 1, #token: 194, token usage: 0.00, gen throughput (token/s): 150.18, #queue-req: 0

[2024-12-08 08:40:42 TP0] Decode batch. #running-req: 1, #token: 2154, token usage: 0.00, gen throughput (token/s): 144.00, #queue-req: 0

Triton (main branch): 147->126

$ python3 -m sglang.bench_offline_throughput --model meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompt 1 --random-input 128 --random-output 2048 --random-range 1 --attention-backend triton

[2024-12-08 08:35:01 TP0] Decode batch. #running-req: 1, #token: 194, token usage: 0.00, gen throughput (token/s): 147.93, #queue-req: 0

[2024-12-08 08:35:15 TP0] Decode batch. #running-req: 1, #token: 2154, token usage: 0.00, gen throughput (token/s): 126.67, #queue-req: 0

Flashinfer: 143->143

$ python3 -m sglang.bench_offline_throughput --model meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompt 1 --random-input 128 --random-output 2048 --random-range 1

[2024-12-08 08:43:15 TP0] Decode batch. #running-req: 1, #token: 194, token usage: 0.00, gen throughput (token/s): 143.84, #queue-req: 0

[2024-12-08 08:43:29 TP0] Decode batch. #running-req: 1, #token: 2154, token usage: 0.00, gen throughput (token/s): 143.24, #queue-req: 0

merrymercy · 2024-12-08T08:55:07Z

python/sglang/srt/layers/attention/triton_ops/decode_attention.py

@@ -705,10 +650,10 @@ def decode_attention_fwd(
            o,
            req_to_token,
            b_req_idx,
-            b_start_loc,


remove this in the func signature of decode_attention_fwd?

Sure. max_len_in_batch and triton_attention_reduce_in_fp32 may also need to be removed.

merrymercy · 2024-12-08T08:55:56Z

python/sglang/srt/layers/attention/triton_backend.py

+                    forward_batch.batch_size,
+                    self.num_head,
+                    self.num_kv_splits,
+                    self.v_head_dim + 1,


After this, we do not need to reduce the cuda graph max bs for deepseek models?

Let me verify it.

WANG-GH · 2025-03-11T08:05:08Z

Hello, does issue #2271 still need further development? I saw this issue and would like to give it a try. However, I noticed that you are currently working on fixing it. Could you share the progress so far? Do you still need any help? @ispobock @merrymercy

ispobock added 5 commits December 7, 2024 16:18

flash decoding draft

c35838d

fix acc

e9e7267

add args

1223a55

fix acc bug

dd4baa4

update ref

35356f0

ispobock requested review from merrymercy, Ying1123, zhyncs, hnyls2002 and ByronHsu as code owners December 8, 2024 08:51

Merge branch 'main' into flash-decoding

d42c098

merrymercy reviewed Dec 8, 2024

View reviewed changes

merrymercy approved these changes Dec 8, 2024

View reviewed changes

Merge branch 'main' into flash-decoding

1be3d36

merrymercy merged commit 7dc66fc into sgl-project:main Dec 8, 2024
0 of 14 checks passed

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Optimize Triton decoding kernel for long context (sgl-project#2394)

aff4fc1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize Triton decoding kernel for long context #2394

Optimize Triton decoding kernel for long context #2394

Uh oh!

ispobock commented Dec 8, 2024 •

edited

Loading

Uh oh!

merrymercy Dec 8, 2024

Uh oh!

ispobock Dec 8, 2024

Uh oh!

merrymercy Dec 8, 2024

Uh oh!

ispobock Dec 8, 2024

Uh oh!

Uh oh!

WANG-GH commented Mar 11, 2025

Uh oh!

Uh oh!

Optimize Triton decoding kernel for long context #2394

Optimize Triton decoding kernel for long context #2394

Uh oh!

Conversation

ispobock commented Dec 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Benchmark

Uh oh!

merrymercy Dec 8, 2024

Choose a reason for hiding this comment

Uh oh!

ispobock Dec 8, 2024

Choose a reason for hiding this comment

Uh oh!

merrymercy Dec 8, 2024

Choose a reason for hiding this comment

Uh oh!

ispobock Dec 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WANG-GH commented Mar 11, 2025

Uh oh!

Uh oh!

ispobock commented Dec 8, 2024 •

edited

Loading