Optimize triton attention custom mask #3731

ispobock · 2025-02-20T16:00:43Z

Motivation

Skip custom mask for prefix part of triton attention to accelerate target verify stage.

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --attention-backend triton
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --parallel 1319

# main
Accuracy: 0.233
Invalid: 0.002
Latency: 187.807 s
Output throughput: 796.177 token/s

# this pr
Accuracy: 0.233
Invalid: 0.002
Latency: 105.266 s
Output throughput: 1421.813 token/s

skip custom mask for prefix part

d12cf10

ispobock requested review from merrymercy, Ying1123 and zhyncs as code owners February 20, 2025 16:00

Merge branch 'main' into skip-custom-mask

fec6422

zhyncs merged commit ddcf9fe into sgl-project:main Feb 20, 2025
16 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize triton attention custom mask #3731

Optimize triton attention custom mask #3731

Uh oh!

ispobock commented Feb 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Optimize triton attention custom mask #3731

Optimize triton attention custom mask #3731

Uh oh!

Conversation

ispobock commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

Uh oh!

Uh oh!

ispobock commented Feb 20, 2025 •

edited

Loading