-
Notifications
You must be signed in to change notification settings - Fork 2.8k
One branch that contains EPLB + Two Batch Overlap + dependencies #5524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
51633b5
to
b69c117
Compare
I tested this PR with DeepEP + EPLB and found that each rank only tracks the expert load on its local GPU, with no cross-rank communication/summation happening at all. The saved expert distribution JSON file shows logical counts for a layer like: This suggests the load balancing logic is not properly aggregating expert usage across all ranks. Could you please clarify or fix this behavior? |
May I ask what the command is to reproduce Case1 with 3P+9D?
|
I found an issue both in self.expert_distribution_communicator = _Communicator(
self.send_to_scheduler, server_args.dp_size
) Could u please double-check if |
Hi, could you please discuss in the issue, since this PR contains many commits and will make comments hidden |
Close this since everything is merged to the master. |
Description
This branch merges various other branches and PRs, including mine and @ch-wan and others'. This branch is not meant to be merged (please merge the various PRs instead). However, this branch serves as the purpose that, people may want to have a try on these features. Indeed it works well and quick now when I test it.
Below (folded) are some pretty early experiments:
Experiment 1: PD + EPLB + TBO (two batch overlap)
gsm8k repeated:
Experiment 2: baseline vs baseline+EPLB vs baseline+EPLB+TBO
Remarks
2025.04.25 Update
I forgot to paste the latest results which were done before... So here are some. You can reproduce them using this branch of code.
Case 1: Direct decode
Case 2: Simulated MTP decode
Case 3: Prefill