Skip to content

Conversation

fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Apr 18, 2025

Description

This branch merges various other branches and PRs, including mine and @ch-wan and others'. This branch is not meant to be merged (please merge the various PRs instead). However, this branch serves as the purpose that, people may want to have a try on these features. Indeed it works well and quick now when I test it.

Below (folded) are some pretty early experiments:

Experiment 1: PD + EPLB + TBO (two batch overlap)

MOONCAKE_CONFIG_PATH=./collab_pd_node8.json python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-mode prefill --host 0.0.0.0 --port 30000 --trust-remote-code --dist-init-addr 10.10.38.8:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 16 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.8 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --enable-two-batch-overlap

MOONCAKE_CONFIG_PATH=./collab_pd_node9.json python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-mode prefill --host 0.0.0.0 --port 30000 --trust-remote-code --dist-init-addr 10.10.38.8:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 16 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.8 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --enable-two-batch-overlap

MOONCAKE_CONFIG_PATH=./collab_pd_node10.json python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-mode decode --host 0.0.0.0 --port 30001 --trust-remote-code --dist-init-addr 10.10.38.10:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 16 --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.8 --cuda-graph-max-bs 128 --max-running-requests 128 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --enable-two-batch-overlap

MOONCAKE_CONFIG_PATH=./collab_pd_node11.json python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-mode decode --host 0.0.0.0 --port 30001 --trust-remote-code --dist-init-addr 10.10.38.10:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 16 --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.8 --cuda-graph-max-bs 128 --max-running-requests 128 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --enable-two-batch-overlap

python3 -m sglang.srt.disaggregation.mini_lb --prefill http://10.10.38.8:30000 --decode http://10.10.38.10:30001 --host 0.0.0.0 --port 7000

(cd /host_home/primary_synced/sglang && while true; do python3 benchmark/gsm8k/bench_sglang.py --port 7000 --parallel 1400 --num-questions 1400; done)

gsm8k repeated:

Accuracy: 0.943
Accuracy: 0.940
Accuracy: 0.948

Experiment 2: baseline vs baseline+EPLB vs baseline+EPLB+TBO

SGL_ENABLE_JIT_DEEPGEMM=1 python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --chunked-prefill-size 262144 --max-total-tokens 131076 --dist-init-addr 10.10.37.16:15000 --nnodes 4 --node-rank 0

SGL_ENABLE_JIT_DEEPGEMM=1 python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --chunked-prefill-size 262144 --max-total-tokens 131076 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --dist-init-addr 10.10.38.4:15000 --nnodes 4 --node-rank 0

SGL_ENABLE_JIT_DEEPGEMM=1 python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --chunked-prefill-size 262144 --max-total-tokens 131076 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --enable-two-batch-overlap --dist-init-addr 10.10.38.8:15000 --nnodes 4 --node-rank 0

while true; do python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 2048 --random-input 1024 --random-output 1 --random-range-ratio 1.0 --max-concurrency 2048 --port 20000 ; done
while true; do python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 2048 --random-input 2048 --random-output 1 --random-range-ratio 1.0 --max-concurrency 2048 --port 20000 ; done
while true; do python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 2048 --random-input 4096 --random-output 1 --random-range-ratio 1.0 --max-concurrency 2048 --port 20000 ; done
baseline baseline+EPLB baseline+EPLB+TBO TBO gain
in=1024 37000 44000 50000 14%
in=2048 not tested 42500 49500 16%
in=4096 not tested 39000 45000 15%

Remarks

  • This is prefill-only 4x8xH100, which mimics the "P" in PD.
  • 4x8xH100 EPLB experiments are done in EPLB's PR comments.
  • EPLB's performance will vary when the distribution change rate changes.
  • The numbers are rough because they change when rerun. I manually write down the roughly mean number. Raw numbers are shown below.
Total token throughput (tok/s):          35776.72  
Total token throughput (tok/s):          28966.29  
Total token throughput (tok/s):          13715.11  
Total token throughput (tok/s):          37119.80  
Total token throughput (tok/s):          35848.38  
Total token throughput (tok/s):          36537.53  
Total token throughput (tok/s):          37410.02  
Total token throughput (tok/s):          37067.82  
Total token throughput (tok/s):          37331.73  
Total token throughput (tok/s):          37090.15  
Total token throughput (tok/s):          29850.90  
Total token throughput (tok/s):          36794.38 

Total token throughput (tok/s):          27436.31  
Total token throughput (tok/s):          44639.40  
Total token throughput (tok/s):          43791.02  
Total token throughput (tok/s):          28606.29  
Total token throughput (tok/s):          44339.44  
Total token throughput (tok/s):          44828.07  
Total token throughput (tok/s):          44545.44  
Total token throughput (tok/s):          42944.52  
Total token throughput (tok/s):          44245.43  
Total token throughput (tok/s):          44233.00  
Total token throughput (tok/s):          44480.95  
Total token throughput (tok/s):          44190.07  
Total token throughput (tok/s):          44571.99  
Total token throughput (tok/s):          45070.98  
Total token throughput (tok/s):          34750.29  
Total token throughput (tok/s):          44381.24  

Total token throughput (tok/s):          43972.36  
Total token throughput (tok/s):          17807.38  
Total token throughput (tok/s):          50539.15  
Total token throughput (tok/s):          50739.94  
Total token throughput (tok/s):          48904.76  
Total token throughput (tok/s):          51171.16  
Total token throughput (tok/s):          49084.35  
Total token throughput (tok/s):          11046.60  
Total token throughput (tok/s):          40685.67  
Total token throughput (tok/s):          49847.51  
Total token throughput (tok/s):          49538.92  
Total token throughput (tok/s):          52816.83  
Total token throughput (tok/s):          25901.15  
Total token throughput (tok/s):          9938.04   
Total token throughput (tok/s):          49635.06  
Total token throughput (tok/s):          50773.30  
Total token throughput (tok/s):          49626.10  
Total token throughput (tok/s):          27523.47  
Total token throughput (tok/s):          39458.44  
Total token throughput (tok/s):          48908.26  
Total token throughput (tok/s):          48692.34  
Total token throughput (tok/s):          52645.97  
Total token throughput (tok/s):          50049.03  
Total token throughput (tok/s):          49214.90 

Total token throughput (tok/s):          42020.11  
Total token throughput (tok/s):          42262.44  
Total token throughput (tok/s):          42545.37  
Total token throughput (tok/s):          42415.52  
Total token throughput (tok/s):          42415.27  
Total token throughput (tok/s):          42591.45  
Total token throughput (tok/s):          42522.59  
Total token throughput (tok/s):          42459.63  
Total token throughput (tok/s):          42395.97 

Total token throughput (tok/s):          49041.05  
Total token throughput (tok/s):          49752.24  
Total token throughput (tok/s):          50040.13  
Total token throughput (tok/s):          49734.44  
Total token throughput (tok/s):          49748.59 

Total token throughput (tok/s):          38556.29  
Total token throughput (tok/s):          38715.00  
Total token throughput (tok/s):          38810.12  
Total token throughput (tok/s):          38983.40  
Total token throughput (tok/s):          38722.26

Total token throughput (tok/s):          45227.81  
Total token throughput (tok/s):          45239.70 

2025.04.25 Update

I forgot to paste the latest results which were done before... So here are some. You can reproduce them using this branch of code.

Case 1: Direct decode

  • Settings: batch size 256, input 1900, output 200 (s.t. the average kv cache is exactly 2000; this is chosen because memory limit), 3P+9D
  • Numbers: About 2800 tok/s per gpu

Case 2: Simulated MTP decode

  • Settings: same as above, but add 80us useless kernels per layer after attention kernels to match the number of overhead of MTP in attention
  • Raw numbers: About 2700 tok/s
  • Calculation: 2700 x 0.8 (assuming 60% accept rate of MTP) = effectively 2160 tok/s

Case 3: Prefill

  • Settings: 16384 token per batch per gpu, 4096 prompt len
  • Numbers: About 2.6s per forward pass, i.e. ~6300 tok/s per gpu
  • If use perfect EPLB: about 2.2s per forward pass, i.e. ~7450 tok/s per gpu

@fzyzcjy fzyzcjy closed this Apr 18, 2025
@fzyzcjy fzyzcjy changed the title [DO NOT MERGE] One branch that contains EPLB + Two Batch Overlap + dependencies One branch that contains EPLB + Two Batch Overlap + dependencies Apr 21, 2025
@fzyzcjy fzyzcjy reopened this Apr 21, 2025
@fzyzcjy fzyzcjy marked this pull request as ready for review April 21, 2025 13:54
@fzyzcjy fzyzcjy force-pushed the feat/dev_branch branch 2 times, most recently from 51633b5 to b69c117 Compare April 24, 2025 23:35
@merrymercy merrymercy mentioned this pull request Apr 25, 2025
67 tasks
@fzyzcjy fzyzcjy mentioned this pull request Apr 29, 2025
6 tasks
@fzyzcjy fzyzcjy requested a review from ch-wan as a code owner April 30, 2025 04:05
@Zars19
Copy link

Zars19 commented May 13, 2025

I tested this PR with DeepEP + EPLB and found that each rank only tracks the expert load on its local GPU, with no cross-rank communication/summation happening at all. The saved expert distribution JSON file shows logical counts for a layer like:
[162554.0, 147879.0, 102827.0, 108924.0, 46123.0, 505540.0, 98864.0, 184719.0, 322237.0, 128708.0, 117722.0, 118096.0, 73528.0, 245469.0, 119943.0, 69937.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
(only experts on one GPU have non-zero values, and all remaining entries are zero).

This suggests the load balancing logic is not properly aggregating expert usage across all ranks. Could you please clarify or fix this behavior?

@yansiyu550
Copy link

May I ask what the command is to reproduce Case1 with 3P+9D?

Case 1: Direct decode
Settings: batch size 256, input 1900, output 200 (s.t. the average kv cache is exactly 2000; this is chosen because memory limit), 3P+9D
Numbers: About 2800 tok/s per gpu

@Zars19
Copy link

Zars19 commented May 15, 2025

I found an issue both in deepseek_ep branch and this branch: when using deepep without DP, the logical_count statistics only capture results from a single device (showing non-zero values for only num_experts/ep_size experts). This works correctly when dp_size=ep_size. The root cause seems to be that _Communicator.fan_out is initialized with dp_size in:

self.expert_distribution_communicator = _Communicator(
    self.send_to_scheduler, server_args.dp_size
)

Could u please double-check if dp_size is the correct parameter here for deepep no-DP scenarios?

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented May 15, 2025

Hi, could you please discuss in the issue, since this PR contains many commits and will make comments hidden

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented May 24, 2025

Close this since everything is merged to the master.

@fzyzcjy fzyzcjy closed this May 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants