One branch that contains EPLB + Two Batch Overlap + dependencies #5524

fzyzcjy · 2025-04-18T09:32:12Z

Description

This branch merges various other branches and PRs, including mine and @ch-wan and others'. This branch is not meant to be merged (please merge the various PRs instead). However, this branch serves as the purpose that, people may want to have a try on these features. Indeed it works well and quick now when I test it.

Below (folded) are some pretty early experiments:

Experiment 1: PD + EPLB + TBO (two batch overlap)

MOONCAKE_CONFIG_PATH=./collab_pd_node8.json python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-mode prefill --host 0.0.0.0 --port 30000 --trust-remote-code --dist-init-addr 10.10.38.8:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 16 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.8 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --enable-two-batch-overlap

MOONCAKE_CONFIG_PATH=./collab_pd_node9.json python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-mode prefill --host 0.0.0.0 --port 30000 --trust-remote-code --dist-init-addr 10.10.38.8:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 16 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.8 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --enable-two-batch-overlap

MOONCAKE_CONFIG_PATH=./collab_pd_node10.json python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-mode decode --host 0.0.0.0 --port 30001 --trust-remote-code --dist-init-addr 10.10.38.10:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 16 --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.8 --cuda-graph-max-bs 128 --max-running-requests 128 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --enable-two-batch-overlap

MOONCAKE_CONFIG_PATH=./collab_pd_node11.json python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-mode decode --host 0.0.0.0 --port 30001 --trust-remote-code --dist-init-addr 10.10.38.10:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 16 --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.8 --cuda-graph-max-bs 128 --max-running-requests 128 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --enable-two-batch-overlap

python3 -m sglang.srt.disaggregation.mini_lb --prefill http://10.10.38.8:30000 --decode http://10.10.38.10:30001 --host 0.0.0.0 --port 7000

(cd /host_home/primary_synced/sglang && while true; do python3 benchmark/gsm8k/bench_sglang.py --port 7000 --parallel 1400 --num-questions 1400; done)

gsm8k repeated:

Accuracy: 0.943
Accuracy: 0.940
Accuracy: 0.948

Experiment 2: baseline vs baseline+EPLB vs baseline+EPLB+TBO

SGL_ENABLE_JIT_DEEPGEMM=1 python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --chunked-prefill-size 262144 --max-total-tokens 131076 --dist-init-addr 10.10.37.16:15000 --nnodes 4 --node-rank 0

SGL_ENABLE_JIT_DEEPGEMM=1 python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --chunked-prefill-size 262144 --max-total-tokens 131076 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --dist-init-addr 10.10.38.4:15000 --nnodes 4 --node-rank 0

SGL_ENABLE_JIT_DEEPGEMM=1 python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --chunked-prefill-size 262144 --max-total-tokens 131076 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --enable-two-batch-overlap --dist-init-addr 10.10.38.8:15000 --nnodes 4 --node-rank 0

while true; do python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 2048 --random-input 1024 --random-output 1 --random-range-ratio 1.0 --max-concurrency 2048 --port 20000 ; done
while true; do python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 2048 --random-input 2048 --random-output 1 --random-range-ratio 1.0 --max-concurrency 2048 --port 20000 ; done
while true; do python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 2048 --random-input 4096 --random-output 1 --random-range-ratio 1.0 --max-concurrency 2048 --port 20000 ; done

	baseline	baseline+EPLB	baseline+EPLB+TBO	TBO gain
in=1024	37000	44000	50000	14%
in=2048	not tested	42500	49500	16%
in=4096	not tested	39000	45000	15%

Remarks

This is prefill-only 4x8xH100, which mimics the "P" in PD.
4x8xH100 EPLB experiments are done in EPLB's PR comments.
EPLB's performance will vary when the distribution change rate changes.
The numbers are rough because they change when rerun. I manually write down the roughly mean number. Raw numbers are shown below.

Total token throughput (tok/s):          35776.72  
Total token throughput (tok/s):          28966.29  
Total token throughput (tok/s):          13715.11  
Total token throughput (tok/s):          37119.80  
Total token throughput (tok/s):          35848.38  
Total token throughput (tok/s):          36537.53  
Total token throughput (tok/s):          37410.02  
Total token throughput (tok/s):          37067.82  
Total token throughput (tok/s):          37331.73  
Total token throughput (tok/s):          37090.15  
Total token throughput (tok/s):          29850.90  
Total token throughput (tok/s):          36794.38 

Total token throughput (tok/s):          27436.31  
Total token throughput (tok/s):          44639.40  
Total token throughput (tok/s):          43791.02  
Total token throughput (tok/s):          28606.29  
Total token throughput (tok/s):          44339.44  
Total token throughput (tok/s):          44828.07  
Total token throughput (tok/s):          44545.44  
Total token throughput (tok/s):          42944.52  
Total token throughput (tok/s):          44245.43  
Total token throughput (tok/s):          44233.00  
Total token throughput (tok/s):          44480.95  
Total token throughput (tok/s):          44190.07  
Total token throughput (tok/s):          44571.99  
Total token throughput (tok/s):          45070.98  
Total token throughput (tok/s):          34750.29  
Total token throughput (tok/s):          44381.24  

Total token throughput (tok/s):          43972.36  
Total token throughput (tok/s):          17807.38  
Total token throughput (tok/s):          50539.15  
Total token throughput (tok/s):          50739.94  
Total token throughput (tok/s):          48904.76  
Total token throughput (tok/s):          51171.16  
Total token throughput (tok/s):          49084.35  
Total token throughput (tok/s):          11046.60  
Total token throughput (tok/s):          40685.67  
Total token throughput (tok/s):          49847.51  
Total token throughput (tok/s):          49538.92  
Total token throughput (tok/s):          52816.83  
Total token throughput (tok/s):          25901.15  
Total token throughput (tok/s):          9938.04   
Total token throughput (tok/s):          49635.06  
Total token throughput (tok/s):          50773.30  
Total token throughput (tok/s):          49626.10  
Total token throughput (tok/s):          27523.47  
Total token throughput (tok/s):          39458.44  
Total token throughput (tok/s):          48908.26  
Total token throughput (tok/s):          48692.34  
Total token throughput (tok/s):          52645.97  
Total token throughput (tok/s):          50049.03  
Total token throughput (tok/s):          49214.90 

Total token throughput (tok/s):          42020.11  
Total token throughput (tok/s):          42262.44  
Total token throughput (tok/s):          42545.37  
Total token throughput (tok/s):          42415.52  
Total token throughput (tok/s):          42415.27  
Total token throughput (tok/s):          42591.45  
Total token throughput (tok/s):          42522.59  
Total token throughput (tok/s):          42459.63  
Total token throughput (tok/s):          42395.97 

Total token throughput (tok/s):          49041.05  
Total token throughput (tok/s):          49752.24  
Total token throughput (tok/s):          50040.13  
Total token throughput (tok/s):          49734.44  
Total token throughput (tok/s):          49748.59 

Total token throughput (tok/s):          38556.29  
Total token throughput (tok/s):          38715.00  
Total token throughput (tok/s):          38810.12  
Total token throughput (tok/s):          38983.40  
Total token throughput (tok/s):          38722.26

Total token throughput (tok/s):          45227.81  
Total token throughput (tok/s):          45239.70

2025.04.25 Update

I forgot to paste the latest results which were done before... So here are some. You can reproduce them using this branch of code.

Case 1: Direct decode

Settings: batch size 256, input 1900, output 200 (s.t. the average kv cache is exactly 2000; this is chosen because memory limit), 3P+9D
Numbers: About 2800 tok/s per gpu

Case 2: Simulated MTP decode

Settings: same as above, but add 80us useless kernels per layer after attention kernels to match the number of overhead of MTP in attention
Raw numbers: About 2700 tok/s
Calculation: 2700 x 0.8 (assuming 60% accept rate of MTP) = effectively 2160 tok/s

Case 3: Prefill

Settings: 16384 token per batch per gpu, 4096 prompt len
Numbers: About 2.6s per forward pass, i.e. ~6300 tok/s per gpu
If use perfect EPLB: about 2.2s per forward pass, i.e. ~7450 tok/s per gpu

Zars19 · 2025-05-13T09:21:13Z

I tested this PR with DeepEP + EPLB and found that each rank only tracks the expert load on its local GPU, with no cross-rank communication/summation happening at all. The saved expert distribution JSON file shows logical counts for a layer like:
[162554.0, 147879.0, 102827.0, 108924.0, 46123.0, 505540.0, 98864.0, 184719.0, 322237.0, 128708.0, 117722.0, 118096.0, 73528.0, 245469.0, 119943.0, 69937.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
(only experts on one GPU have non-zero values, and all remaining entries are zero).

This suggests the load balancing logic is not properly aggregating expert usage across all ranks. Could you please clarify or fix this behavior?

yansiyu550 · 2025-05-15T02:20:00Z

May I ask what the command is to reproduce Case1 with 3P+9D?

Case 1: Direct decode
Settings: batch size 256, input 1900, output 200 (s.t. the average kv cache is exactly 2000; this is chosen because memory limit), 3P+9D
Numbers: About 2800 tok/s per gpu

Zars19 · 2025-05-15T03:40:01Z

I found an issue both in deepseek_ep branch and this branch: when using deepep without DP, the logical_count statistics only capture results from a single device (showing non-zero values for only num_experts/ep_size experts). This works correctly when dp_size=ep_size. The root cause seems to be that _Communicator.fan_out is initialized with dp_size in:

self.expert_distribution_communicator = _Communicator(
    self.send_to_scheduler, server_args.dp_size
)

Could u please double-check if dp_size is the correct parameter here for deepep no-DP scenarios?

fzyzcjy · 2025-05-15T06:19:11Z

Hi, could you please discuss in the issue, since this PR contains many commits and will make comments hidden

fzyzcjy · 2025-05-24T14:10:15Z

Close this since everything is merged to the master.

fzyzcjy closed this Apr 18, 2025

fzyzcjy mentioned this pull request Apr 21, 2025

Use 2x less memory for MoE (including the DisposableTensor small tool) #5085

Closed

6 tasks

fzyzcjy changed the title ~~[DO NOT MERGE] One branch that contains EPLB + Two Batch Overlap + dependencies~~ One branch that contains EPLB + Two Batch Overlap + dependencies Apr 21, 2025

fzyzcjy reopened this Apr 21, 2025

fzyzcjy marked this pull request as ready for review April 21, 2025 13:54

fzyzcjy requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock, ByronHsu, xiezhq-hermann, HaiShaw and zhaochenyang20 as code owners April 21, 2025 13:54

fzyzcjy force-pushed the feat/dev_branch branch 2 times, most recently from 51633b5 to b69c117 Compare April 24, 2025 23:35

merrymercy mentioned this pull request Apr 25, 2025

Development Roadmap (2025 H1) #4042

Open

67 tasks

fzyzcjy mentioned this pull request Apr 29, 2025

EPLB Simulator #5890

Open

6 tasks

liz-badada mentioned this pull request Apr 29, 2025

[Feature] Overlap DeepEP Combine and Shared Experts inside same batch #5829

Closed

6 tasks

fzyzcjy requested a review from ch-wan as a code owner April 30, 2025 04:05

This was referenced May 4, 2025

[PD] Allow customizing reserved tokens to avoid KV cache waste #6002

Merged

Instruction for Running DeepSeek with Large-scale PD and EP #6017

Open

fzyzcjy added 7 commits May 13, 2025 22:14

more

ad6e3d3

more

c8ce8b8

more

b53b8ff

more

966954a

more

44595f3

more

198a5b3

more

0461f38

fzyzcjy added 25 commits May 14, 2025 16:33

more

72f3837

more

c0a8929

more

e3fa9f7

more

a8eea26

more

3c9850e

more

d5b15ad

more

561eeec

more

78e0c2f

more

0ec4152

more

48b2fba

more

bbb0a0a

more

9635050

more

15cf341

more

d93ba92

more

b23d4ac

more

fd2c593

more

1c64cfa

more

b2cfadf

more

75d8a0d

more

4066b85

more

c3e4389

more

eccb93f

more

f68c9be

more

d09d580

more

fc9cd49

fzyzcjy closed this May 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

One branch that contains EPLB + Two Batch Overlap + dependencies #5524

One branch that contains EPLB + Two Batch Overlap + dependencies #5524

Uh oh!

fzyzcjy commented Apr 18, 2025 •

edited

Loading

Uh oh!

Zars19 commented May 13, 2025

Uh oh!

yansiyu550 commented May 15, 2025

Uh oh!

Zars19 commented May 15, 2025 •

edited

Loading

Uh oh!

fzyzcjy commented May 15, 2025

Uh oh!

fzyzcjy commented May 24, 2025 •

edited by merrymercy

Loading

Uh oh!

Uh oh!

One branch that contains EPLB + Two Batch Overlap + dependencies #5524

One branch that contains EPLB + Two Batch Overlap + dependencies #5524

Uh oh!

Conversation

fzyzcjy commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Experiment 1: PD + EPLB + TBO (two batch overlap)

Experiment 2: baseline vs baseline+EPLB vs baseline+EPLB+TBO

2025.04.25 Update

Uh oh!

Zars19 commented May 13, 2025

Uh oh!

yansiyu550 commented May 15, 2025

Uh oh!

Zars19 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented May 15, 2025

Uh oh!

fzyzcjy commented May 24, 2025 • edited by merrymercy Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fzyzcjy commented Apr 18, 2025 •

edited

Loading

Zars19 commented May 15, 2025 •

edited

Loading

fzyzcjy commented May 24, 2025 •

edited by merrymercy

Loading