Skip to content

Conversation

ispobock
Copy link
Collaborator

Motivation

For no prefix or short prefix cases, it's compute bound. We should use MHA to reduce prefill computation.

Benchmark results on DeepSeek-Coder-V2-Lite-Instruct:

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --trust-remote-code --host 127.0.0.1 --disable-radix
python3 -m sglang.bench_one_batch_server --model None --base-url http://0.0.0.0:30000--batch-size 128 --input-len 1024 --output-len 1

batch size: 128
latency: 2.59 s
output throughput: 49.51 token/s
(input + output) throughput: 50750.40 token/s

batch size: 128
latency: 2.28 s
output throughput: 56.12 token/s
(input + output) throughput: 57522.51 token/s

@@ -561,11 +561,6 @@ def __init__(
"SGLANG_ROCM_FUSED_DECODE_MLA", "false"
)

# TODO: Design a finer way to determine the threshold
self.chunked_prefix_cache_threshold = get_int_env_var(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunked_prefix_cache_threshold is necessary. You can test isl 128,256,512,1024 osl 128,256,512,1024 bs 1,2,4,8,16,24,32 to verify.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel when there is no prefix, it makes sense to turn on MHA optimization.

@ispobock ispobock closed this Apr 22, 2025
@ispobock ispobock deleted the attn-dispatch branch April 22, 2025 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants