-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… #8511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…er if epmoe is enabled
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
@@ -506,8 +506,11 @@ def forward_normal( | |||
else: | |||
kwargs["router_logits"] = router_logits | |||
final_hidden_states = self.experts(**kwargs) | |||
if not _is_cuda and not _use_aiter: | |||
if not _is_cuda and not _use_aiter or global_server_args_dict["enable_ep_moe"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can have a property in self.experts
whether to apply routing scaling factor in the model?
Or can we fuse it for EpMoe too?
These are the scenarios I know about for cuda:
- FusedMoE triton backend: multiply is fused into
moe_sum_reduce
, but also need to divide from shared experts inbiased_grouped_topk
to cancel it out - FusedMoE model_opt FP4 (with or without enable_ep_moe): applied in
ModelOptNvFp4FusedMoEMethod
but [1/2] sgl-kernel: Fuse routed scaling factor into select_experts #8364 will fuse multiply intobiased_grouped_topk
. I'm worried this change will cause it to be applied twice for this path - EpMoE - was missing but this PR will fix and bring to model forward
- DeepEpMoE - in model forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me fix this after 8515 is merged.
sgl-project#8511) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
#8511) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
#8511) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
#8511) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
sgl-project#8511) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
sgl-project#8511) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
…er if epmoe is enabled
Motivation
Issue: #8402
Also noticed significant regression when running epmoe during recent GLM4.5 support work: GSM8K accuracy drops from 0.965 to 0.745 when EPMOE is enabled. Accuracy is good for TP & DeepEP.
Modifications
There is a bug in DeepSeek-V2 and GLM-4.5 related to how routed_scaling_factor is applied in MoE (Mixture-of-Experts) layers.
Currently, the routed_scaling_factor is applied in three different places, leading to ambiguity:
This results in uncertain and inconsistent scaling, as the model layer has no visibility into whether routed_scaling_factor has already been applied upstream.
🎪 TL;DR: The model can't know if it should apply * routed_scaling_factor or not, because topk and experts may or may not have already done it, depending on the code path.
My PR forces model layer to apply * routed_scaling_factor when EPMOE is enabled because from the current codebase, epmoe won't apply *routed_scaling_factor by itself.
Need to follow up and possibly refactor sglang moe codebase to make it clear which layer should apply * routed_scaling_factor
Accuracy Test
Benchmark & Profiling
Checklist