Skip to content

Conversation

byjiang1996
Copy link
Contributor

@byjiang1996 byjiang1996 commented Jul 29, 2025

…er if epmoe is enabled

Motivation

Issue: #8402

Also noticed significant regression when running epmoe during recent GLM4.5 support work: GSM8K accuracy drops from 0.965 to 0.745 when EPMOE is enabled. Accuracy is good for TP & DeepEP.

python3 -m sglang.launch_server --model  /shared/public/elr-models/zai-org/GLM-4.5 --tp-size 8 --trust-remote-code

python3 benchmark/gsm8k/bench_sglang.py

Modifications

There is a bug in DeepSeek-V2 and GLM-4.5 related to how routed_scaling_factor is applied in MoE (Mixture-of-Experts) layers.

Currently, the routed_scaling_factor is applied in three different places, leading to ambiguity:

  • self.topk.forward: The scaling is applied only in the n out of m condition, but not in the m - n out of m condition.
  • self.experts.forward: Same as above — the factor is only applied in n out of m.
  • Model-level logic: The model itself may apply routed_scaling_factor, but it cannot know whether self.topk or self.experts have already done so.

This results in uncertain and inconsistent scaling, as the model layer has no visibility into whether routed_scaling_factor has already been applied upstream.

🎪 TL;DR: The model can't know if it should apply * routed_scaling_factor or not, because topk and experts may or may not have already done it, depending on the code path.

My PR forces model layer to apply * routed_scaling_factor when EPMOE is enabled because from the current codebase, epmoe won't apply *routed_scaling_factor by itself.

Need to follow up and possibly refactor sglang moe codebase to make it clear which layer should apply * routed_scaling_factor

Accuracy Test

  • GLM4.5 GSM8K accuracy jumps from 0.745 to 0.965 under EPMOE which is the same as TP & DeepEP results
  • Need to run gsm8k for deepseek-v3 too - WIP

Benchmark & Profiling

Checklist

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@zhyncs zhyncs requested review from trevor-m, kushanam and ch-wan July 29, 2025 07:01
@@ -506,8 +506,11 @@ def forward_normal(
else:
kwargs["router_logits"] = router_logits
final_hidden_states = self.experts(**kwargs)
if not _is_cuda and not _use_aiter:
if not _is_cuda and not _use_aiter or global_server_args_dict["enable_ep_moe"]:
Copy link
Collaborator

@trevor-m trevor-m Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can have a property in self.experts whether to apply routing scaling factor in the model?
Or can we fuse it for EpMoe too?

These are the scenarios I know about for cuda:

  • FusedMoE triton backend: multiply is fused into moe_sum_reduce, but also need to divide from shared experts in biased_grouped_topk to cancel it out
  • FusedMoE model_opt FP4 (with or without enable_ep_moe): applied in ModelOptNvFp4FusedMoEMethod but [1/2] sgl-kernel: Fuse routed scaling factor into select_experts #8364 will fuse multiply into biased_grouped_topk. I'm worried this change will cause it to be applied twice for this path
  • EpMoE - was missing but this PR will fix and bring to model forward
  • DeepEpMoE - in model forward

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me fix this after 8515 is merged.

@ch-wan ch-wan merged commit c8d3a40 into sgl-project:main Aug 1, 2025
TianQiLin666666 pushed a commit to TianQiLin666666/sglang that referenced this pull request Aug 1, 2025
sgl-project#8511)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
lifuhuang pushed a commit that referenced this pull request Aug 3, 2025
#8511)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025
#8511)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025
#8511)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
sgl-project#8511)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 18, 2025
sgl-project#8511)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants