-
Notifications
You must be signed in to change notification settings - Fork 2.8k
EPLB #5295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EPLB #5295
Conversation
# Conflicts: # python/sglang/bench_one_batch.py # python/sglang/srt/layers/moe/ep_moe/layer.py # python/sglang/srt/layers/moe/topk.py # python/sglang/srt/managers/schedule_batch.py # python/sglang/srt/managers/scheduler.py # python/sglang/srt/managers/tokenizer_manager.py # python/sglang/srt/managers/tp_worker.py # python/sglang/srt/model_executor/model_runner.py # python/sglang/srt/models/deepseek_v2.py # python/sglang/srt/server_args.py # python/sglang/srt/utils.py
# Conflicts: # python/sglang/srt/managers/schedule_batch.py # python/sglang/srt/model_executor/model_runner.py # python/sglang/srt/server_args.py
Hi @fzyzcjy, Thanks for your excellent work on the EPLB mechanism — it’s very inspiring! I’m currently a student interested in EPLB. Recently, I came across your feat/dev_branch and wanted to study the implementation of EPLB in depth, especially regarding how expert distribution and rebalancing are handled. However, when checking out the branch, I noticed that some components appear to be missing or incomplete. I tried to patch things myself, but couldn't fully reproduce the intended behavior. May I ask: Thanks again for sharing your work publicly. I’d really appreciate any guidance! |
EPLB already there |
Thank you very much for your contributions to EPLB! I would like to ask you a question. Does the current main branch support "Support changing locations of experts when server is running"? If not, which branch should I switch to? I want to analyze the overhead of expert migration during the running process. @fzyzcjy |
Sure, |
Thank you very much for your answer! I would also like to ask you for a piece of advice. If I want to profile the overhead of expert migration when rebalancing, which files or classes of code do suggest you I study? (main branch) @fzyzcjy |
start from EPLBManager and the logic is pretty easy to read |
Thanks! |
Hi @fzyzcjy, I noticed that when rebalance happens, all requests need to wait until rebalance finishes, first, is my understanding correct? if that is the case, I'm thinking, since only redundant experts change during the rebalance, is it possible to not let requests wait for it? |
@tianhaoz95 Hi
almost all (at least most) experts change indeed |
Hi expert, take DeepSeek V3 as an example, what is the approximate rebalance delay? Do you have any test data? |
@nannaer for R1 on a single host with 8 GPU I have:
|
@fzyzcjy Got it. Right in my brief testing 156 out of 160 experts gets replaced for a single layer. However, I was thinking this 160 is still a combination of 128 distinct experts and 32 redundant experts, and only this 32 redundant experts are different experts, is this correct? |
@fzyzcjy also, i'm trying to understand |
I am a bit confused about this... |
@fzyzcjy my bad, let me try with an example. For example, if the model layer has 5 expert, and I add 3 redundant experts. let's assume we now have 8 physical experts: expert 0,1,2,3,4 and 1,1,2 (where expert 1 is duplicated 2 times and expert 2 1 time). when i set |
imho static dispatch is useful when there are a large number of ranks |
This PR contains all code from the PR chain (please merge that chain instead of this one), and I will put experiment results below.
What's next
eplb_rebalance
overhead: The logic is there, but there are many easy optimizations to further reduce the rebalance overhead and fix minor bugs (e.g. try to make the load-from-pin-memory mode share pinned memory across processes). Given that the EPLB feature is needed emergentlym, I will lower the priority of this optimization and do it later.eplb_rebalance
: Given the priority, I will do that later.moe_fused_gate
kernel and it can be fused.PR Chain
If you want to try PD + EPLB + two-batch-overlap + ..., here is the branch that merges everything before they are merged into master: https://github.com/fzyzcjy/sglang/tree/feat/dev_branch
These can be reviewed:
These are outdated and I will update later:
SomeA lot of code currently only exist in this branch, but should be extracted to separate PRs later, such as:Low priority future work
2025.04.11 Quick Experiment 1: 2x8xH100
Note: I have not done ANY profiling or other code tuning, and directly dump the result from the most naive code.
Explanations
Outputs
baseline (the even ones are sharegpt, odd ones are random, please ignore first several runs which are kind of wawrmup)
PR
Conclusion
2025.04.11 Quick Experiment 2: 4x8xH100
Output
baseline output
PR output
Conclusion
2025.04.12 Quick Experiment 3: 4x8xH100
Output
baseline output
PR output
Conclusion
Update
Update 2025.04.18
To ensure the latest code still has correctness and performance, here is experiment for latest code:
Output
Update 2025.04.18: EPLB + PD correctness test
Use https://github.com/fzyzcjy/sglang/tree/feat/dev_branch branch (which contains PRs about PD + EPLB)
Get 94.5 gsm8k