EPLB #5295

fzyzcjy · 2025-04-11T11:54:21Z

This PR contains all code from the PR chain (please merge that chain instead of this one), and I will put experiment results below.

What's next

Further reduce eplb_rebalance overhead: The logic is there, but there are many easy optimizations to further reduce the rebalance overhead and fix minor bugs (e.g. try to make the load-from-pin-memory mode share pinned memory across processes). Given that the EPLB feature is needed emergentlym, I will lower the priority of this optimization and do it later.
Further test eplb_rebalance: Given the priority, I will do that later.
There is a new moe_fused_gate kernel and it can be fused.

PR Chain

If you want to try PD + EPLB + two-batch-overlap + ..., here is the branch that merges everything before they are merged into master: https://github.com/fzyzcjy/sglang/tree/feat/dev_branch

These can be reviewed:

Tiny refactor ModelConfig.from_server_args #5219 Tiny refactor ModelConfig.from_server_args (no dependency)
Tiny refactor weight loading logic #5232 Tiny refactor weight loading logic (no dependency)
Make flush_cache await execution and return response #5242 Make flush_cache await execution and return response (no dep)
Tiny add Engine.flush_cache API #5241 Tiny add Engine.flush_cache API (dep on 5241)
Use 2x less memory for MoE (including the DisposableTensor small tool) #5085 (no dependency)
Fix DeepGEMM masked cannot be run on groups not being multiple or 4 #5340 (no dependency)
Tiny refactor DefaultModelLoader.Source #5482 (no dependency)
Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 #4836 Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 (no dependency)
Make profiler output file names consistent #5548

These are outdated and I will update later:

Expert distribution recording without overhead for EPLB #4957 Expert distribution recording without overhead for EPLB
Allow physical experts to be different from logical experts #5226 Allow physical experts to be different from logical experts
Support redundant experts in expert parallel #5227 Support redundant experts in expert parallel
Allow users to specify arbitrary expert locations #5235 Allow users to specify arbitrary expert locations
Support EPLB algorithm #5231 Support EPLB algorithm
Support partial weight updates from disk #5236 Support partial weight updates from disk
Support fine-grained control of requests that are run together #4699 Support fine-grained control of requests that are run together
Support changing locations of experts when server is running #5238 Support changing locations of experts when server is running (direct dep: 5231, 5236, 4699)
Allow experts to be rebalanced periodically #5240 Allow experts to be rebalanced periodically
Add end-to-end tests for EPLB #5243 Add end-to-end tests for EPLB
Add eplb_save_expert_distribution and SGLANG_LOG_EXPERT_LOCATION_METADATA #5266 Add eplb_save_expert_distribution and SGLANG_LOG_EXPERT_LOCATION_METADATA
Support users to pass in expert distributions in init_expert_location #5268 Support users to pass in expert distributions in init_expert_location
NOTE: ~~Some~~ A lot of code currently only exist in this branch, but should be extracted to separate PRs later, such as:
- A simulator to test EPLB or related algorithms without running server
- Enhancements to benchmark scripts
- Expert distribution recorder for low_latency decode
- Update: Many more things... I will extract when reviewers have time to review.

Low priority future work

Support overlap scheduler for dynamic mode (currently static mode supports it): Because speculative decoding and PD disaggregation requires disabling overlap scheduler, and EPLB works best with them, so it seems it is not a high priority task to make it support overlap scheduler.
Only enable expert distribution recorder for a fraction of requests instead of all of them in online EPLB mode: Given we want to rebalance very frequently, it seems we do not need this.
Maybe try to determine how to dispatch when the gate output is calculated in each batch

2025.04.11 Quick Experiment 1: 2x8xH100

Note: I have not done ANY profiling or other code tuning, and directly dump the result from the most naive code.

# prepare: collect stat data using sharegpt
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --trust-remote-code --tp 16 --dp 16 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --disable-overlap-schedule --decode-log-interval 1 --host 0.0.0.0 --port 20000 --enable-eplb --ep-num-redundant-experts 32 --dist-init-addr 10.10.38.6:15000 --nnodes 2 --node-rank 0
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 1000 --sharegpt-output-len 1 --max-concurrency 64 --port 20000
curl -X POST http://127.0.0.1:20000/eplb_save_expert_distribution
cp /tmp/eplb_storage/expert_distribution_storage/YOUR_NAME.json /host_home/temp_sglang_server2local/

# baseline
SGLANG_LOG_EXPERT_LOCATION_METADATA=1 python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --trust-remote-code --tp 16 --dp 16 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --disable-overlap-schedule --decode-log-interval 1 --host 0.0.0.0 --port 20000 --dist-init-addr 10.10.38.8:15000 --nnodes 2 --node-rank 0

# PR
SGLANG_LOG_EXPERT_LOCATION_METADATA=1 python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --trust-remote-code --tp 16 --dp 16 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --init-expert-location /host_home/temp_sglang_server2local/YOUR_NAME.json --ep-num-redundant-experts 32 --dist-init-addr 10.10.38.6:15000 --nnodes 2 --node-rank 0

# tests
(cd /host_home/primary_synced/sglang && while true; do python3 benchmark/gsm8k/bench_sglang.py --port 20000 --parallel 1400 --num-questions 1400; done)
while true; do python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 1000 --sharegpt-output-len 4 --max-concurrency 64 --port 20000 ; python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 1000 --random-input 1000 --random-output 1 --random-range-ratio 1.0 --max-concurrency 64 --port 20000 ; done

Explanations

Here I do the most manual way by using one run to collect stat and another run to use the stat. But directly calling rebalance on-the-fly is also implemented and has passed the tests.
Since I collect stat from sharegpt dataset, I test both sharegpt and random to avoid the case that I am overfitting sharegpt's scenario
Here I only test prefill and not decode (because I use dp attention, thus not able to have deepep-mode=auto, thus need to use deepep-mode=combine, thus cannot have cuda graph, thus decode is super slow)
- The sharegpt dataset requires at least 4 new tokens, so it contains some decode. The random dataset is prefill-only.

Outputs

baseline (the even ones are sharegpt, odd ones are random, please ignore first several runs which are kind of wawrmup)

Accuracy: 0.933
Accuracy: 0.930
Accuracy: 0.932
Accuracy: 0.934
Accuracy: 0.934
Accuracy: 0.930
Accuracy: 0.934
Accuracy: 0.934
Accuracy: 0.933
Accuracy: 0.934

Total token throughput (tok/s):          4201.48   
Total token throughput (tok/s):          16533.91  
Total token throughput (tok/s):          6899.32   
Total token throughput (tok/s):          17985.68  
Total token throughput (tok/s):          6898.84   
Total token throughput (tok/s):          17982.58  
Total token throughput (tok/s):          6919.91   
Total token throughput (tok/s):          18081.89  
Total token throughput (tok/s):          5069.11   
Total token throughput (tok/s):          18024.01  
Total token throughput (tok/s):          6899.14   
Total token throughput (tok/s):          17867.05  
Total token throughput (tok/s):          6116.35   
Total token throughput (tok/s):          18053.47  
Total token throughput (tok/s):          6926.80   
Total token throughput (tok/s):          18023.24  
Total token throughput (tok/s):          6866.18   
Total token throughput (tok/s):          18059.59  
Total token throughput (tok/s):          6925.82   
Total token throughput (tok/s):          18077.90  
Total token throughput (tok/s):          7089.84   
Total token throughput (tok/s):          18086.66  
Total token throughput (tok/s):          6943.99   
Total token throughput (tok/s):          18029.46  
Total token throughput (tok/s):          6831.31

PR

Accuracy: 0.927
Accuracy: 0.937
Accuracy: 0.936
Accuracy: 0.932
Accuracy: 0.936
Accuracy: 0.933
Accuracy: 0.936

Total token throughput (tok/s):          6505.96   
Total token throughput (tok/s):          19834.88  
Total token throughput (tok/s):          4747.47   
Total token throughput (tok/s):          19671.99  
Total token throughput (tok/s):          4301.55   
Total token throughput (tok/s):          19692.20  
Total token throughput (tok/s):          6472.16   
Total token throughput (tok/s):          19743.38  
Total token throughput (tok/s):          6438.49   
Total token throughput (tok/s):          19772.33  
Total token throughput (tok/s):          6337.45   
Total token throughput (tok/s):          18282.91  
Total token throughput (tok/s):          6340.48   
Total token throughput (tok/s):          19469.87  
Total token throughput (tok/s):          6683.97   
Total token throughput (tok/s):          19692.82  
Total token throughput (tok/s):          6501.55   
Total token throughput (tok/s):          19730.55  
Total token throughput (tok/s):          6512.66   
Total token throughput (tok/s):          19683.61  
Total token throughput (tok/s):          6550.42   
Total token throughput (tok/s):          19725.90  
Total token throughput (tok/s):          6507.50   
Total token throughput (tok/s):          19810.17  
Total token throughput (tok/s):          6531.86   
Total token throughput (tok/s):          19834.79  
Total token throughput (tok/s):          6488.62   
Total token throughput (tok/s):          19296.52  
Total token throughput (tok/s):          6128.42

Conclusion

Collect stats from sharegpt, then test on random: 18000 -> 19500, there is speedup
Collect stats from sharegpt, then test on sharegpt: 7000 -> 6500, there is slowdown, but since this one contains decode and as mentioned above, decode is out-of-scope here, I am not sure whether this is really a slowdown or not. Will do more experiments about decode later.
Correctness on gsm8k: Seems roughly equal
EDIT: realize baseline wrongly disable-overlap-schedule so the numbers are wrong, will fix it in today's experiments --- fixed in experiment 3

2025.04.11 Quick Experiment 2: 4x8xH100

# baseline
SGLANG_LOG_EXPERT_LOCATION_METADATA=1 python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --disable-overlap-schedule --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --chunked-prefill-size 32768 --dist-init-addr 10.10.38.4:15000 --nnodes 4 --node-rank 0

# PR
SGLANG_LOG_EXPERT_LOCATION_METADATA=1 python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --chunked-prefill-size 32768 --dist-init-addr 10.10.38.4:15000 --nnodes 4 --node-rank 0

# bench
while true; do python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 1000 --random-input 1000 --random-output 1 --random-range-ratio 1.0 --max-concurrency 1000 --port 20000 ; done

Output

baseline output

Total token throughput (tok/s):          26736.89  
Total token throughput (tok/s):          32337.00  
Total token throughput (tok/s):          32368.31  
Total token throughput (tok/s):          32034.95  
Total token throughput (tok/s):          31059.59  
Total token throughput (tok/s):          32322.06  
Total token throughput (tok/s):          32821.19  
Total token throughput (tok/s):          32908.73  
Total token throughput (tok/s):          32756.34  
Total token throughput (tok/s):          32906.58  
Total token throughput (tok/s):          33375.62  
Total token throughput (tok/s):          33160.48  
Total token throughput (tok/s):          33233.37  
Total token throughput (tok/s):          33103.22  
Total token throughput (tok/s):          33150.36  
Total token throughput (tok/s):          33320.48  
Total token throughput (tok/s):          33138.29  
Total token throughput (tok/s):          33161.51

PR output

Total token throughput (tok/s):          28799.74  
Total token throughput (tok/s):          36454.41  
Total token throughput (tok/s):          31820.63  
Total token throughput (tok/s):          34819.66  
Total token throughput (tok/s):          36648.10  
Total token throughput (tok/s):          36529.53  
Total token throughput (tok/s):          36300.26  
Total token throughput (tok/s):          37110.16  
Total token throughput (tok/s):          36968.22  
Total token throughput (tok/s):          36791.78  
Total token throughput (tok/s):          38198.47  
Total token throughput (tok/s):          37466.60  
Total token throughput (tok/s):          38136.77  
Total token throughput (tok/s):          38384.78  
Total token throughput (tok/s):          38322.69  
Total token throughput (tok/s):          37399.52  
Total token throughput (tok/s):          38184.65  
Total token throughput (tok/s):          37580.02

Conclusion

Roughly 33000 -> 37500
EDIT: realize baseline wrongly disable-overlap-schedule so the numbers are wrong, will fix it in today's experiments --- fixed in experiment 3

2025.04.12 Quick Experiment 3: 4x8xH100

# baseline
SGLANG_LOG_EXPERT_LOCATION_METADATA=1 python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --chunked-prefill-size 262144 --max-total-tokens 131076 --dist-init-addr 10.10.37.16:15000 --nnodes 4 --node-rank 0

# PR
SGLANG_LOG_EXPERT_LOCATION_METADATA=1 python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --chunked-prefill-size 262144 --max-total-tokens 131076 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --dist-init-addr 10.10.38.4:15000 --nnodes 4 --node-rank 0

# bench
while true; do python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 2048 --random-input 1024 --random-output 1 --random-range-ratio 1.0 --max-concurrency 2048 --port 20000 ; done

Output

baseline output

Total token throughput (tok/s):          27674.34  
Total token throughput (tok/s):          28727.64  
Total token throughput (tok/s):          30776.73  
Total token throughput (tok/s):          31028.49  
Total token throughput (tok/s):          20955.54  
Total token throughput (tok/s):          25092.32  
Total token throughput (tok/s):          30976.76  
Total token throughput (tok/s):          30892.42  
Total token throughput (tok/s):          30206.75  
Total token throughput (tok/s):          31271.44  
Total token throughput (tok/s):          31619.85  
Total token throughput (tok/s):          30824.69  
Total token throughput (tok/s):          30973.65  
Total token throughput (tok/s):          31600.21  
Total token throughput (tok/s):          30503.18  
Total token throughput (tok/s):          31693.68  
Total token throughput (tok/s):          24758.57  
Total token throughput (tok/s):          31781.42  
Total token throughput (tok/s):          31557.28  
Total token throughput (tok/s):          31542.53

PR output

Total token throughput (tok/s):          31486.29  
Total token throughput (tok/s):          15746.00  
Total token throughput (tok/s):          20702.08  
Total token throughput (tok/s):          35706.74  
Total token throughput (tok/s):          34202.98  
Total token throughput (tok/s):          34990.21  
Total token throughput (tok/s):          35077.61  
Total token throughput (tok/s):          35654.64  
Total token throughput (tok/s):          35710.16  
Total token throughput (tok/s):          26289.05  
Total token throughput (tok/s):          37157.12  
Total token throughput (tok/s):          37312.28  
Total token throughput (tok/s):          37287.00  
Total token throughput (tok/s):          35910.75  
Total token throughput (tok/s):          35809.48  
Total token throughput (tok/s):          37008.45  
Total token throughput (tok/s):          35527.77  
Total token throughput (tok/s):          27380.26  
Total token throughput (tok/s):          37558.58  
Total token throughput (tok/s):          37124.57  
Total token throughput (tok/s):          37017.02

Conclusion

Throughput 31500 -> 37000

Update

Prefill + sharegpt dataset: Also observe speedup 30500 -> 34500
Decode + random dataset (b/c low-latency deepgemm restrictions I have to make prompt length super short): Also observe utilization rate change from 54.6% to 78.1%, which agrees with eplb_simulator.py's prediction of 54.7% -> 79.7%. For speedup, since the GEMM takes a tiny portion in the current profile, surely it cannot be observed today, until the other parts are faster.
The speedup is related to distribution and distribution shift, so you can use eplb_simulator.py to simulate the theoretical speedup on your data, and check whether the actual speedup agrees with it

Update 2025.04.18

To ensure the latest code still has correctness and performance, here is experiment for latest code:

# baseline
python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --chunked-prefill-size 131076 --max-total-tokens 65536 --dist-init-addr 10.10.38.4:15000 --nnodes 4 --node-rank 0

# PR
python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --chunked-prefill-size 131076 --max-total-tokens 65536 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --dist-init-addr 10.10.38.8:15000 --nnodes 4 --node-rank 0

# correctness
(cd /host_home/primary_synced/sglang && while true; do python3 benchmark/gsm8k/bench_sglang.py --port 20000 --parallel 1400 --num-questions 1400; done)

# bench
while true; do python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 2048 --random-input 1024 --random-output 1 --random-range-ratio 1.0 --max-concurrency 2048 --port 20000 ; done

Output

GSM8k: 93.2 vs 93.6
Speed: 30500 vs 35000

Update 2025.04.18: EPLB + PD correctness test

Use https://github.com/fzyzcjy/sglang/tree/feat/dev_branch branch (which contains PRs about PD + EPLB)

MOONCAKE_CONFIG_PATH=./collab_pd_node8.json python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-mode prefill --host 0.0.0.0 --port 30000 --trust-remote-code --dist-init-addr 10.10.38.8:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.8 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32

MOONCAKE_CONFIG_PATH=./collab_pd_node9.json python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-mode prefill --host 0.0.0.0 --port 30000 --trust-remote-code --dist-init-addr 10.10.38.8:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.8 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32

MOONCAKE_CONFIG_PATH=./collab_pd_node10.json python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-mode decode --host 0.0.0.0 --port 30001 --trust-remote-code --dist-init-addr 10.10.38.10:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.8 --cuda-graph-max-bs 128 --max-running-requests 128 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32

MOONCAKE_CONFIG_PATH=./collab_pd_node11.json python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-mode decode --host 0.0.0.0 --port 30001 --trust-remote-code --dist-init-addr 10.10.38.10:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.8 --cuda-graph-max-bs 128 --max-running-requests 128 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32

python3 -m sglang.srt.disaggregation.mini_lb --prefill http://10.10.38.8:30000 --decode http://10.10.38.10:30001 --host 0.0.0.0 --port 7000

(cd /host_home/primary_synced/sglang && while true; do python3 benchmark/gsm8k/bench_sglang.py --port 7000 --parallel 1400 --num-questions 1400; done)

Get 94.5 gsm8k

# Conflicts: # python/sglang/bench_one_batch.py # python/sglang/srt/layers/moe/ep_moe/layer.py # python/sglang/srt/layers/moe/topk.py # python/sglang/srt/managers/schedule_batch.py # python/sglang/srt/managers/scheduler.py # python/sglang/srt/managers/tokenizer_manager.py # python/sglang/srt/managers/tp_worker.py # python/sglang/srt/model_executor/model_runner.py # python/sglang/srt/models/deepseek_v2.py # python/sglang/srt/server_args.py # python/sglang/srt/utils.py

# Conflicts: # python/sglang/srt/managers/schedule_batch.py # python/sglang/srt/model_executor/model_runner.py # python/sglang/srt/server_args.py

hihiztc1 · 2025-05-12T02:53:43Z

Hi @fzyzcjy,

Thanks for your excellent work on the EPLB mechanism — it’s very inspiring!

I’m currently a student interested in EPLB. Recently, I came across your feat/dev_branch and wanted to study the implementation of EPLB in depth, especially regarding how expert distribution and rebalancing are handled.

However, when checking out the branch, I noticed that some components appear to be missing or incomplete. I tried to patch things myself, but couldn't fully reproduce the intended behavior.

May I ask:
Which commit ID or tag contains a complete and working version of the EPLB implementation?

Thanks again for sharing your work publicly. I’d really appreciate any guidance!

fzyzcjy · 2025-05-12T02:58:20Z

@hihiztc1 Hi, please check #6017 which has full guidance

fzyzcjy · 2025-05-24T14:08:13Z

EPLB already there

nannaer · 2025-06-23T07:18:40Z

Thank you very much for your contributions to EPLB! I would like to ask you a question. Does the current main branch support "Support changing locations of experts when server is running"? If not, which branch should I switch to? I want to analyze the overhead of expert migration during the running process. @fzyzcjy

fzyzcjy · 2025-06-23T07:51:31Z

Support changing locations of experts when server is running

Sure, --enable-eplb

nannaer · 2025-06-23T08:15:05Z

Support changing locations of experts when server is running

Sure, --enable-eplb

Thank you very much for your answer! I would also like to ask you for a piece of advice. If I want to profile the overhead of expert migration when rebalancing, which files or classes of code do suggest you I study? (main branch) @fzyzcjy

fzyzcjy · 2025-06-23T14:35:58Z

start from EPLBManager and the logic is pretty easy to read

nannaer · 2025-06-24T06:49:45Z

start from EPLBManager and the logic is pretty easy to read

Thanks!

tianhaoz95 · 2025-07-08T03:29:07Z

Hi @fzyzcjy, I noticed that when rebalance happens, all requests need to wait until rebalance finishes, first, is my understanding correct? if that is the case, I'm thinking, since only redundant experts change during the rebalance, is it possible to not let requests wait for it?

fzyzcjy · 2025-07-08T08:48:58Z

@tianhaoz95 Hi

since only redundant experts change during the rebalance

almost all (at least most) experts change indeed

nannaer · 2025-07-08T11:25:31Z

@tianhaoz95 Hi

since only redundant experts change during the rebalance

almost all (at least most) experts change indeed

Hi expert, take DeepSeek V3 as an example, what is the approximate rebalance delay? Do you have any test data?

tianhaoz95 · 2025-07-09T23:15:12Z

@nannaer for R1 on a single host with 8 GPU I have:

--- Expert Placement Latency Measurement ---
Latency: 0.6830 seconds for updating 58 layers.

--- Expert Placement Latency Measurement ---
Latency: 0.6732 seconds for updating 58 layers.

--- Expert Placement Latency Measurement ---
Latency: 0.5920 seconds for updating 58 layers.

--- Expert Placement Latency Measurement ---
Latency: 0.6737 seconds for updating 58 layers.

--- Expert Placement Latency Measurement ---
Latency: 0.5941 seconds for updating 58 layers.

--- Expert Placement Latency Measurement ---
Latency: 0.6907 seconds for updating 58 layers.

--- Expert Placement Latency Measurement ---
Latency: 0.6790 seconds for updating 58 layers.

--- Expert Placement Latency Measurement ---
Latency: 0.6887 seconds for updating 58 layers.

tianhaoz95 · 2025-07-09T23:18:00Z

@tianhaoz95 Hi

since only redundant experts change during the rebalance

almost all (at least most) experts change indeed

@fzyzcjy Got it. Right in my brief testing 156 out of 160 experts gets replaced for a single layer. However, I was thinking this 160 is still a combination of 128 distinct experts and 32 redundant experts, and only this 32 redundant experts are different experts, is this correct?

tianhaoz95 · 2025-07-09T23:37:27Z

@fzyzcjy also, i'm trying to understand --ep-dispatch-algorithm static option, it almost looks like this will ignore all the redundant experts and only map the gating logits to non-redundant experts, am I missing something?

fzyzcjy · 2025-07-10T10:11:32Z

it almost looks like this will ignore all the redundant experts

I am a bit confused about this...

tianhaoz95 · 2025-07-10T17:08:26Z

it almost looks like this will ignore all the redundant experts

I am a bit confused about this...

@fzyzcjy my bad, let me try with an example.

For example, if the model layer has 5 expert, and I add 3 redundant experts.

let's assume we now have 8 physical experts: expert 0,1,2,3,4 and 1,1,2 (where expert 1 is duplicated 2 times and expert 2 1 time).

when i set --ep-dispatch-algorithm static and when my gating logits maps to expert [1,2] (assuming topk is 2), looks like topk_ids_logical_to_physical will always output [1,2] and the 2 redundant expert 1 will not be used.

fzyzcjy · 2025-07-11T14:05:27Z

imho static dispatch is useful when there are a large number of ranks

fzyzcjy requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock, ByronHsu, xiezhq-hermann, HaiShaw and zhaochenyang20 as code owners April 11, 2025 11:54

fzyzcjy mentioned this pull request Apr 12, 2025

Expert distribution recording without overhead for EPLB #4957

Merged

6 tasks

fzyzcjy changed the title ~~[DO NOT MERGE] EPLB~~ EPLB Apr 12, 2025

fzyzcjy mentioned this pull request Apr 12, 2025

[Feature] Deep EPLB Integration Proposal (Draft) #5309

Closed

2 tasks

ch-wan mentioned this pull request Apr 15, 2025

[Roadmap] EP Enhancement #4734

Closed

18 tasks

fzyzcjy added 17 commits April 17, 2025 10:05

more

f760bdb

more

39bc973

more

ef61f21

more

8c5926d

more

4ff6c40

more

8560a4d

more

6505e2b

more

3d3b9f5

more

09dee3d

more

0efd3a8

more

9b9f06e

more

b546f07

more

394bd52

more

f223889

more

65947e3

more

0ce12ff

more

af4a7cc

fzyzcjy added 10 commits April 18, 2025 10:32

more

a880d02

more

30728d2

fix merge

a7445c4

fmt

c61324e

fix merge

5ce21cd

fix merge

5286fdd

fix merge

fb77370

fix merge

f100686

Merge branch 'main-upstream' into feat/eplb_final

3c8b60d

# Conflicts: # python/sglang/srt/managers/schedule_batch.py # python/sglang/srt/model_executor/model_runner.py # python/sglang/srt/server_args.py

fzyzcjy mentioned this pull request Apr 19, 2025

Remove one kernel in per_tensor_quant_mla_fp8 #5549

Merged

6 tasks

feliang-git mentioned this pull request Apr 24, 2025

DeepSeek-V3/R1 MoE load balance deployment and inference using EPLB #5270

Closed

6 tasks

fzyzcjy closed this May 24, 2025

EPLB #5295

EPLB #5295

Uh oh!

Conversation

fzyzcjy commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's next

PR Chain

Low priority future work

2025.04.11 Quick Experiment 1: 2x8xH100

2025.04.11 Quick Experiment 2: 4x8xH100

2025.04.12 Quick Experiment 3: 4x8xH100

Update

Update 2025.04.18

Update 2025.04.18: EPLB + PD correctness test

Uh oh!

hihiztc1 commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented May 12, 2025

Uh oh!

fzyzcjy commented May 24, 2025

Uh oh!

nannaer commented Jun 23, 2025

Uh oh!

fzyzcjy commented Jun 23, 2025

Uh oh!

nannaer commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented Jun 23, 2025

Uh oh!

nannaer commented Jun 24, 2025

Uh oh!

tianhaoz95 commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nannaer commented Jul 8, 2025

Uh oh!

tianhaoz95 commented Jul 9, 2025

Uh oh!

tianhaoz95 commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianhaoz95 commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented Jul 10, 2025

Uh oh!

tianhaoz95 commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fzyzcjy commented Apr 11, 2025 •

edited

Loading

hihiztc1 commented May 12, 2025 •

edited

Loading

nannaer commented Jun 23, 2025 •

edited

Loading

tianhaoz95 commented Jul 8, 2025 •

edited

Loading

fzyzcjy commented Jul 8, 2025 •

edited

Loading

tianhaoz95 commented Jul 9, 2025 •

edited

Loading

tianhaoz95 commented Jul 9, 2025 •

edited

Loading

tianhaoz95 commented Jul 10, 2025 •

edited

Loading

fzyzcjy commented Jul 11, 2025 •

edited

Loading