Speed up when having padding tokens in DeepEP #6175

fzyzcjy · 2025-05-10T13:57:23Z

Motivation

test

PYTHONUNBUFFERED=1 SGLANG_TORCH_PROFILER_DIR=/host_home/temp_sglang_server2local python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-R1 --trust-remote-code --dist-init-addr 192.168.0.55:5757 --nnodes 2 --node-rank ${MY_NODE_RANK} --tp-size ${num_gpu} --dp-size ${num_gpu} --enable-dp-attention --mem-fraction-static 0.8 --chunked-prefill-size $((128*${num_gpu})) --max-running-requests $((${num_gpu}*128)) --context-length 4096 --disable-radix-cache --enable-deepep-moe --deepep-mode low_latency --cuda-graph-bs 128 --decode-log-interval 1

python3 -m sglang.bench_one_batch_server --model-path /dev/shm/DeepSeek-R1 --base-url http://localhost:30000 --batch-size 16 --input-len 1 --output-len 2048 --skip-warmup

baseline: 6 tok/s/gpu
PR: 29 tok/s/gpu

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

python/sglang/srt/layers/moe/topk.py

…into feat/padding_moe

lambert0312 · 2025-05-20T10:06:03Z

This pr will significantly reduce DeepSeek's inference performance (15%+). Need to look at the specific reasons.

fzyzcjy · 2025-05-20T11:37:46Z

@lambert0312 Looks bad. Could you please show your commands, and would be great to have a profile. My first guess is that, we need to fuse it.

lambert0312 · 2025-05-21T00:49:21Z

@lambert0312 Looks bad. Could you please show your commands, and would be great to have a profile. My first guess is that, we need to fuse it.

@fzyzcjy I tried to modify it. You can see the PR I linked above. Thank you.

fzyzcjy · 2025-05-21T01:05:03Z

Interesting, I thought this line already makes no extra kernels are executed.

fzyzcjy added 17 commits May 10, 2025 21:22

more

1208fb1

more

4529cc4

more

92330a2

more

ae70984

more

95440df

more

5dc985f

more

39204c6

more

8189122

more

2970122

more

d477d4f

more

a137cee

more

1d2f206

fmt

9ef32b4

more

ae6a10d

more

a8f037d

more

cec1bf5

more

eb97a26

fzyzcjy requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock, ByronHsu, HaiShaw and ch-wan as code owners May 10, 2025 13:57

fmt

51247b8

ch-wan reviewed May 10, 2025

View reviewed changes

python/sglang/srt/layers/moe/topk.py Show resolved Hide resolved

Update topk.py

3fecc76

ch-wan approved these changes May 11, 2025

View reviewed changes

zhyncs added the high priority label May 11, 2025

fzyzcjy marked this pull request as draft May 12, 2025 00:04

fzyzcjy force-pushed the feat/padding_moe branch from 8797942 to 3fecc76 Compare May 12, 2025 00:09

fzyzcjy marked this pull request as ready for review May 12, 2025 00:09

fzyzcjy and others added 9 commits May 12, 2025 08:09

Merge branch 'main' into feat/padding_moe

c3fece0

more

9414109

more

ed5c4b5

Merge branch 'feat/padding_moe' of https://github.com/fzyzcjy/sglang …

bd315ff

…into feat/padding_moe

more

d885df6

more

8e235e2

more

536b595

fmt

67d963d

more

ab84bc7

zhyncs merged commit 2716830 into sgl-project:main May 17, 2025
113 of 128 checks passed

lambert0312 mentioned this pull request May 21, 2025

Fix topk inference performance reduce #6474

Merged

6 tasks

fzyzcjy mentioned this pull request May 27, 2025

Speed up when having padding tokens two-batch overlap #6668

Merged

6 tasks

Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025

Speed up when having padding tokens in DeepEP (sgl-project#6175)

26de0da

xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025

Speed up when having padding tokens in DeepEP (sgl-project#6175)

376abc1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up when having padding tokens in DeepEP #6175

Speed up when having padding tokens in DeepEP #6175

Uh oh!

fzyzcjy commented May 10, 2025

Uh oh!

Uh oh!

Uh oh!

lambert0312 commented May 20, 2025

Uh oh!

fzyzcjy commented May 20, 2025

Uh oh!

lambert0312 commented May 21, 2025

Uh oh!

fzyzcjy commented May 21, 2025

Uh oh!

Uh oh!

Speed up when having padding tokens in DeepEP #6175

Speed up when having padding tokens in DeepEP #6175

Uh oh!

Conversation

fzyzcjy commented May 10, 2025

Motivation

Modifications

Checklist

Uh oh!

Uh oh!

Uh oh!

lambert0312 commented May 20, 2025

Uh oh!

fzyzcjy commented May 20, 2025

Uh oh!

lambert0312 commented May 21, 2025

Uh oh!

fzyzcjy commented May 21, 2025

Uh oh!

Uh oh!