Skip to content

Conversation

fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Apr 11, 2025

This PR contains all code from the PR chain (please merge that chain instead of this one), and I will put experiment results below.

What's next

  • Further reduce eplb_rebalance overhead: The logic is there, but there are many easy optimizations to further reduce the rebalance overhead and fix minor bugs (e.g. try to make the load-from-pin-memory mode share pinned memory across processes). Given that the EPLB feature is needed emergentlym, I will lower the priority of this optimization and do it later.
  • Further test eplb_rebalance: Given the priority, I will do that later.
  • There is a new moe_fused_gate kernel and it can be fused.

PR Chain

If you want to try PD + EPLB + two-batch-overlap + ..., here is the branch that merges everything before they are merged into master: https://github.com/fzyzcjy/sglang/tree/feat/dev_branch

These can be reviewed:

These are outdated and I will update later:

Low priority future work

  • Support overlap scheduler for dynamic mode (currently static mode supports it): Because speculative decoding and PD disaggregation requires disabling overlap scheduler, and EPLB works best with them, so it seems it is not a high priority task to make it support overlap scheduler.
  • Only enable expert distribution recorder for a fraction of requests instead of all of them in online EPLB mode: Given we want to rebalance very frequently, it seems we do not need this.
  • Maybe try to determine how to dispatch when the gate output is calculated in each batch

2025.04.11 Quick Experiment 1: 2x8xH100

Note: I have not done ANY profiling or other code tuning, and directly dump the result from the most naive code.

# prepare: collect stat data using sharegpt
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --trust-remote-code --tp 16 --dp 16 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --disable-overlap-schedule --decode-log-interval 1 --host 0.0.0.0 --port 20000 --enable-eplb --ep-num-redundant-experts 32 --dist-init-addr 10.10.38.6:15000 --nnodes 2 --node-rank 0
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 1000 --sharegpt-output-len 1 --max-concurrency 64 --port 20000
curl -X POST http://127.0.0.1:20000/eplb_save_expert_distribution
cp /tmp/eplb_storage/expert_distribution_storage/YOUR_NAME.json /host_home/temp_sglang_server2local/

# baseline
SGLANG_LOG_EXPERT_LOCATION_METADATA=1 python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --trust-remote-code --tp 16 --dp 16 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --disable-overlap-schedule --decode-log-interval 1 --host 0.0.0.0 --port 20000 --dist-init-addr 10.10.38.8:15000 --nnodes 2 --node-rank 0

# PR
SGLANG_LOG_EXPERT_LOCATION_METADATA=1 python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --trust-remote-code --tp 16 --dp 16 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --init-expert-location /host_home/temp_sglang_server2local/YOUR_NAME.json --ep-num-redundant-experts 32 --dist-init-addr 10.10.38.6:15000 --nnodes 2 --node-rank 0

# tests
(cd /host_home/primary_synced/sglang && while true; do python3 benchmark/gsm8k/bench_sglang.py --port 20000 --parallel 1400 --num-questions 1400; done)
while true; do python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 1000 --sharegpt-output-len 4 --max-concurrency 64 --port 20000 ; python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 1000 --random-input 1000 --random-output 1 --random-range-ratio 1.0 --max-concurrency 64 --port 20000 ; done

Explanations

  • Here I do the most manual way by using one run to collect stat and another run to use the stat. But directly calling rebalance on-the-fly is also implemented and has passed the tests.
  • Since I collect stat from sharegpt dataset, I test both sharegpt and random to avoid the case that I am overfitting sharegpt's scenario
  • Here I only test prefill and not decode (because I use dp attention, thus not able to have deepep-mode=auto, thus need to use deepep-mode=combine, thus cannot have cuda graph, thus decode is super slow)
    • The sharegpt dataset requires at least 4 new tokens, so it contains some decode. The random dataset is prefill-only.

Outputs

baseline (the even ones are sharegpt, odd ones are random, please ignore first several runs which are kind of wawrmup)

Accuracy: 0.933
Accuracy: 0.930
Accuracy: 0.932
Accuracy: 0.934
Accuracy: 0.934
Accuracy: 0.930
Accuracy: 0.934
Accuracy: 0.934
Accuracy: 0.933
Accuracy: 0.934

Total token throughput (tok/s):          4201.48   
Total token throughput (tok/s):          16533.91  
Total token throughput (tok/s):          6899.32   
Total token throughput (tok/s):          17985.68  
Total token throughput (tok/s):          6898.84   
Total token throughput (tok/s):          17982.58  
Total token throughput (tok/s):          6919.91   
Total token throughput (tok/s):          18081.89  
Total token throughput (tok/s):          5069.11   
Total token throughput (tok/s):          18024.01  
Total token throughput (tok/s):          6899.14   
Total token throughput (tok/s):          17867.05  
Total token throughput (tok/s):          6116.35   
Total token throughput (tok/s):          18053.47  
Total token throughput (tok/s):          6926.80   
Total token throughput (tok/s):          18023.24  
Total token throughput (tok/s):          6866.18   
Total token throughput (tok/s):          18059.59  
Total token throughput (tok/s):          6925.82   
Total token throughput (tok/s):          18077.90  
Total token throughput (tok/s):          7089.84   
Total token throughput (tok/s):          18086.66  
Total token throughput (tok/s):          6943.99   
Total token throughput (tok/s):          18029.46  
Total token throughput (tok/s):          6831.31

PR

Accuracy: 0.927
Accuracy: 0.937
Accuracy: 0.936
Accuracy: 0.932
Accuracy: 0.936
Accuracy: 0.933
Accuracy: 0.936

Total token throughput (tok/s):          6505.96   
Total token throughput (tok/s):          19834.88  
Total token throughput (tok/s):          4747.47   
Total token throughput (tok/s):          19671.99  
Total token throughput (tok/s):          4301.55   
Total token throughput (tok/s):          19692.20  
Total token throughput (tok/s):          6472.16   
Total token throughput (tok/s):          19743.38  
Total token throughput (tok/s):          6438.49   
Total token throughput (tok/s):          19772.33  
Total token throughput (tok/s):          6337.45   
Total token throughput (tok/s):          18282.91  
Total token throughput (tok/s):          6340.48   
Total token throughput (tok/s):          19469.87  
Total token throughput (tok/s):          6683.97   
Total token throughput (tok/s):          19692.82  
Total token throughput (tok/s):          6501.55   
Total token throughput (tok/s):          19730.55  
Total token throughput (tok/s):          6512.66   
Total token throughput (tok/s):          19683.61  
Total token throughput (tok/s):          6550.42   
Total token throughput (tok/s):          19725.90  
Total token throughput (tok/s):          6507.50   
Total token throughput (tok/s):          19810.17  
Total token throughput (tok/s):          6531.86   
Total token throughput (tok/s):          19834.79  
Total token throughput (tok/s):          6488.62   
Total token throughput (tok/s):          19296.52  
Total token throughput (tok/s):          6128.42   

Conclusion

  • Collect stats from sharegpt, then test on random: 18000 -> 19500, there is speedup
  • Collect stats from sharegpt, then test on sharegpt: 7000 -> 6500, there is slowdown, but since this one contains decode and as mentioned above, decode is out-of-scope here, I am not sure whether this is really a slowdown or not. Will do more experiments about decode later.
  • Correctness on gsm8k: Seems roughly equal
  • EDIT: realize baseline wrongly disable-overlap-schedule so the numbers are wrong, will fix it in today's experiments --- fixed in experiment 3

2025.04.11 Quick Experiment 2: 4x8xH100

# baseline
SGLANG_LOG_EXPERT_LOCATION_METADATA=1 python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --disable-overlap-schedule --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --chunked-prefill-size 32768 --dist-init-addr 10.10.38.4:15000 --nnodes 4 --node-rank 0

# PR
SGLANG_LOG_EXPERT_LOCATION_METADATA=1 python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --chunked-prefill-size 32768 --dist-init-addr 10.10.38.4:15000 --nnodes 4 --node-rank 0

# bench
while true; do python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 1000 --random-input 1000 --random-output 1 --random-range-ratio 1.0 --max-concurrency 1000 --port 20000 ; done

Output

baseline output

Total token throughput (tok/s):          26736.89  
Total token throughput (tok/s):          32337.00  
Total token throughput (tok/s):          32368.31  
Total token throughput (tok/s):          32034.95  
Total token throughput (tok/s):          31059.59  
Total token throughput (tok/s):          32322.06  
Total token throughput (tok/s):          32821.19  
Total token throughput (tok/s):          32908.73  
Total token throughput (tok/s):          32756.34  
Total token throughput (tok/s):          32906.58  
Total token throughput (tok/s):          33375.62  
Total token throughput (tok/s):          33160.48  
Total token throughput (tok/s):          33233.37  
Total token throughput (tok/s):          33103.22  
Total token throughput (tok/s):          33150.36  
Total token throughput (tok/s):          33320.48  
Total token throughput (tok/s):          33138.29  
Total token throughput (tok/s):          33161.51  

PR output

Total token throughput (tok/s):          28799.74  
Total token throughput (tok/s):          36454.41  
Total token throughput (tok/s):          31820.63  
Total token throughput (tok/s):          34819.66  
Total token throughput (tok/s):          36648.10  
Total token throughput (tok/s):          36529.53  
Total token throughput (tok/s):          36300.26  
Total token throughput (tok/s):          37110.16  
Total token throughput (tok/s):          36968.22  
Total token throughput (tok/s):          36791.78  
Total token throughput (tok/s):          38198.47  
Total token throughput (tok/s):          37466.60  
Total token throughput (tok/s):          38136.77  
Total token throughput (tok/s):          38384.78  
Total token throughput (tok/s):          38322.69  
Total token throughput (tok/s):          37399.52  
Total token throughput (tok/s):          38184.65  
Total token throughput (tok/s):          37580.02  

Conclusion

  • Roughly 33000 -> 37500
  • EDIT: realize baseline wrongly disable-overlap-schedule so the numbers are wrong, will fix it in today's experiments --- fixed in experiment 3

2025.04.12 Quick Experiment 3: 4x8xH100

# baseline
SGLANG_LOG_EXPERT_LOCATION_METADATA=1 python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --chunked-prefill-size 262144 --max-total-tokens 131076 --dist-init-addr 10.10.37.16:15000 --nnodes 4 --node-rank 0

# PR
SGLANG_LOG_EXPERT_LOCATION_METADATA=1 python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --chunked-prefill-size 262144 --max-total-tokens 131076 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --dist-init-addr 10.10.38.4:15000 --nnodes 4 --node-rank 0

# bench
while true; do python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 2048 --random-input 1024 --random-output 1 --random-range-ratio 1.0 --max-concurrency 2048 --port 20000 ; done

Output

baseline output

Total token throughput (tok/s):          27674.34  
Total token throughput (tok/s):          28727.64  
Total token throughput (tok/s):          30776.73  
Total token throughput (tok/s):          31028.49  
Total token throughput (tok/s):          20955.54  
Total token throughput (tok/s):          25092.32  
Total token throughput (tok/s):          30976.76  
Total token throughput (tok/s):          30892.42  
Total token throughput (tok/s):          30206.75  
Total token throughput (tok/s):          31271.44  
Total token throughput (tok/s):          31619.85  
Total token throughput (tok/s):          30824.69  
Total token throughput (tok/s):          30973.65  
Total token throughput (tok/s):          31600.21  
Total token throughput (tok/s):          30503.18  
Total token throughput (tok/s):          31693.68  
Total token throughput (tok/s):          24758.57  
Total token throughput (tok/s):          31781.42  
Total token throughput (tok/s):          31557.28  
Total token throughput (tok/s):          31542.53  

PR output

Total token throughput (tok/s):          31486.29  
Total token throughput (tok/s):          15746.00  
Total token throughput (tok/s):          20702.08  
Total token throughput (tok/s):          35706.74  
Total token throughput (tok/s):          34202.98  
Total token throughput (tok/s):          34990.21  
Total token throughput (tok/s):          35077.61  
Total token throughput (tok/s):          35654.64  
Total token throughput (tok/s):          35710.16  
Total token throughput (tok/s):          26289.05  
Total token throughput (tok/s):          37157.12  
Total token throughput (tok/s):          37312.28  
Total token throughput (tok/s):          37287.00  
Total token throughput (tok/s):          35910.75  
Total token throughput (tok/s):          35809.48  
Total token throughput (tok/s):          37008.45  
Total token throughput (tok/s):          35527.77  
Total token throughput (tok/s):          27380.26  
Total token throughput (tok/s):          37558.58  
Total token throughput (tok/s):          37124.57  
Total token throughput (tok/s):          37017.02  

Conclusion

  • Throughput 31500 -> 37000

Update

  • Prefill + sharegpt dataset: Also observe speedup 30500 -> 34500
  • Decode + random dataset (b/c low-latency deepgemm restrictions I have to make prompt length super short): Also observe utilization rate change from 54.6% to 78.1%, which agrees with eplb_simulator.py's prediction of 54.7% -> 79.7%. For speedup, since the GEMM takes a tiny portion in the current profile, surely it cannot be observed today, until the other parts are faster.
  • The speedup is related to distribution and distribution shift, so you can use eplb_simulator.py to simulate the theoretical speedup on your data, and check whether the actual speedup agrees with it

Update 2025.04.18

To ensure the latest code still has correctness and performance, here is experiment for latest code:

# baseline
python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --chunked-prefill-size 131076 --max-total-tokens 65536 --dist-init-addr 10.10.38.4:15000 --nnodes 4 --node-rank 0

# PR
python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code --tp 32 --dp 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 20000 --moe-dense-tp-size 1 --chunked-prefill-size 131076 --max-total-tokens 65536 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32 --dist-init-addr 10.10.38.8:15000 --nnodes 4 --node-rank 0

# correctness
(cd /host_home/primary_synced/sglang && while true; do python3 benchmark/gsm8k/bench_sglang.py --port 20000 --parallel 1400 --num-questions 1400; done)

# bench
while true; do python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 2048 --random-input 1024 --random-output 1 --random-range-ratio 1.0 --max-concurrency 2048 --port 20000 ; done

Output

  • GSM8k: 93.2 vs 93.6
  • Speed: 30500 vs 35000

Update 2025.04.18: EPLB + PD correctness test

Use https://github.com/fzyzcjy/sglang/tree/feat/dev_branch branch (which contains PRs about PD + EPLB)

MOONCAKE_CONFIG_PATH=./collab_pd_node8.json python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-mode prefill --host 0.0.0.0 --port 30000 --trust-remote-code --dist-init-addr 10.10.38.8:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.8 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32

MOONCAKE_CONFIG_PATH=./collab_pd_node9.json python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-mode prefill --host 0.0.0.0 --port 30000 --trust-remote-code --dist-init-addr 10.10.38.8:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.8 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32

MOONCAKE_CONFIG_PATH=./collab_pd_node10.json python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-mode decode --host 0.0.0.0 --port 30001 --trust-remote-code --dist-init-addr 10.10.38.10:5000 --nnodes 2 --node-rank 0 --tp-size 16 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.8 --cuda-graph-max-bs 128 --max-running-requests 128 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32

MOONCAKE_CONFIG_PATH=./collab_pd_node11.json python -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --disaggregation-mode decode --host 0.0.0.0 --port 30001 --trust-remote-code --dist-init-addr 10.10.38.10:5000 --nnodes 2 --node-rank 1 --tp-size 16 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.8 --cuda-graph-max-bs 128 --max-running-requests 128 --init-expert-location /host_home/temp_sglang_server2local/1744367695354916448.json --ep-num-redundant-experts 32

python3 -m sglang.srt.disaggregation.mini_lb --prefill http://10.10.38.8:30000 --decode http://10.10.38.10:30001 --host 0.0.0.0 --port 7000

(cd /host_home/primary_synced/sglang && while true; do python3 benchmark/gsm8k/bench_sglang.py --port 7000 --parallel 1400 --num-questions 1400; done)

Get 94.5 gsm8k

fzyzcjy added 10 commits April 18, 2025 10:32
# Conflicts:
#	python/sglang/bench_one_batch.py
#	python/sglang/srt/layers/moe/ep_moe/layer.py
#	python/sglang/srt/layers/moe/topk.py
#	python/sglang/srt/managers/schedule_batch.py
#	python/sglang/srt/managers/scheduler.py
#	python/sglang/srt/managers/tokenizer_manager.py
#	python/sglang/srt/managers/tp_worker.py
#	python/sglang/srt/model_executor/model_runner.py
#	python/sglang/srt/models/deepseek_v2.py
#	python/sglang/srt/server_args.py
#	python/sglang/srt/utils.py
# Conflicts:
#	python/sglang/srt/managers/schedule_batch.py
#	python/sglang/srt/model_executor/model_runner.py
#	python/sglang/srt/server_args.py
@hihiztc1
Copy link

hihiztc1 commented May 12, 2025

Hi @fzyzcjy,

Thanks for your excellent work on the EPLB mechanism — it’s very inspiring!

I’m currently a student interested in EPLB. Recently, I came across your feat/dev_branch and wanted to study the implementation of EPLB in depth, especially regarding how expert distribution and rebalancing are handled.

However, when checking out the branch, I noticed that some components appear to be missing or incomplete. I tried to patch things myself, but couldn't fully reproduce the intended behavior.

May I ask:
Which commit ID or tag contains a complete and working version of the EPLB implementation?

Thanks again for sharing your work publicly. I’d really appreciate any guidance!
EPLB_error

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented May 12, 2025

@hihiztc1 Hi, please check #6017 which has full guidance

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented May 24, 2025

EPLB already there

@fzyzcjy fzyzcjy closed this May 24, 2025
@nannaer
Copy link

nannaer commented Jun 23, 2025

Thank you very much for your contributions to EPLB! I would like to ask you a question. Does the current main branch support "Support changing locations of experts when server is running"? If not, which branch should I switch to? I want to analyze the overhead of expert migration during the running process. @fzyzcjy

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Jun 23, 2025

Support changing locations of experts when server is running

Sure, --enable-eplb

@nannaer
Copy link

nannaer commented Jun 23, 2025

Support changing locations of experts when server is running

Sure, --enable-eplb

Support changing locations of experts when server is running

Sure, --enable-eplb

Thank you very much for your answer! I would also like to ask you for a piece of advice. If I want to profile the overhead of expert migration when rebalancing, which files or classes of code do suggest you I study? (main branch) @fzyzcjy

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Jun 23, 2025

start from EPLBManager and the logic is pretty easy to read

@nannaer
Copy link

nannaer commented Jun 24, 2025

start from EPLBManager and the logic is pretty easy to read

Thanks!

@tianhaoz95
Copy link

tianhaoz95 commented Jul 8, 2025

Hi @fzyzcjy, I noticed that when rebalance happens, all requests need to wait until rebalance finishes, first, is my understanding correct? if that is the case, I'm thinking, since only redundant experts change during the rebalance, is it possible to not let requests wait for it?

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Jul 8, 2025

@tianhaoz95 Hi

since only redundant experts change during the rebalance

almost all (at least most) experts change indeed

@nannaer
Copy link

nannaer commented Jul 8, 2025

@tianhaoz95 Hi

since only redundant experts change during the rebalance

almost all (at least most) experts change indeed

Hi expert, take DeepSeek V3 as an example, what is the approximate rebalance delay? Do you have any test data?

@tianhaoz95
Copy link

@nannaer for R1 on a single host with 8 GPU I have:

--- Expert Placement Latency Measurement ---
Latency: 0.6830 seconds for updating 58 layers.

--- Expert Placement Latency Measurement ---
Latency: 0.6732 seconds for updating 58 layers.

--- Expert Placement Latency Measurement ---
Latency: 0.5920 seconds for updating 58 layers.

--- Expert Placement Latency Measurement ---
Latency: 0.6737 seconds for updating 58 layers.

--- Expert Placement Latency Measurement ---
Latency: 0.5941 seconds for updating 58 layers.

--- Expert Placement Latency Measurement ---
Latency: 0.6907 seconds for updating 58 layers.

--- Expert Placement Latency Measurement ---
Latency: 0.6790 seconds for updating 58 layers.

--- Expert Placement Latency Measurement ---
Latency: 0.6887 seconds for updating 58 layers.

@tianhaoz95
Copy link

tianhaoz95 commented Jul 9, 2025

@tianhaoz95 Hi

since only redundant experts change during the rebalance

almost all (at least most) experts change indeed

@fzyzcjy Got it. Right in my brief testing 156 out of 160 experts gets replaced for a single layer. However, I was thinking this 160 is still a combination of 128 distinct experts and 32 redundant experts, and only this 32 redundant experts are different experts, is this correct?

@tianhaoz95
Copy link

tianhaoz95 commented Jul 9, 2025

@fzyzcjy also, i'm trying to understand --ep-dispatch-algorithm static option, it almost looks like this will ignore all the redundant experts and only map the gating logits to non-redundant experts, am I missing something?

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Jul 10, 2025

it almost looks like this will ignore all the redundant experts

I am a bit confused about this...

@tianhaoz95
Copy link

tianhaoz95 commented Jul 10, 2025

it almost looks like this will ignore all the redundant experts

I am a bit confused about this...

@fzyzcjy my bad, let me try with an example.

For example, if the model layer has 5 expert, and I add 3 redundant experts.

let's assume we now have 8 physical experts: expert 0,1,2,3,4 and 1,1,2 (where expert 1 is duplicated 2 times and expert 2 1 time).

when i set --ep-dispatch-algorithm static and when my gating logits maps to expert [1,2] (assuming topk is 2), looks like topk_ids_logical_to_physical will always output [1,2] and the 2 redundant expert 1 will not be used.

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Jul 11, 2025

imho static dispatch is useful when there are a large number of ranks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants