Skip to content

Conversation

ispobock
Copy link
Collaborator

@ispobock ispobock commented Feb 20, 2025

Motivation

Skip custom mask for prefix part of triton attention to accelerate target verify stage.

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --attention-backend triton
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --parallel 1319

# main
Accuracy: 0.233
Invalid: 0.002
Latency: 187.807 s
Output throughput: 796.177 token/s

# this pr
Accuracy: 0.233
Invalid: 0.002
Latency: 105.266 s
Output throughput: 1421.813 token/s

@zhyncs zhyncs merged commit ddcf9fe into sgl-project:main Feb 20, 2025
16 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants