Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… #8511

byjiang1996 · 2025-07-29T06:51:43Z

…er if epmoe is enabled

Motivation

Also noticed significant regression when running epmoe during recent GLM4.5 support work: GSM8K accuracy drops from 0.965 to 0.745 when EPMOE is enabled. Accuracy is good for TP & DeepEP.

python3 -m sglang.launch_server --model  /shared/public/elr-models/zai-org/GLM-4.5 --tp-size 8 --trust-remote-code

python3 benchmark/gsm8k/bench_sglang.py

Modifications

There is a bug in DeepSeek-V2 and GLM-4.5 related to how routed_scaling_factor is applied in MoE (Mixture-of-Experts) layers.

Currently, the routed_scaling_factor is applied in three different places, leading to ambiguity:

self.topk.forward: The scaling is applied only in the n out of m condition, but not in the m - n out of m condition.
self.experts.forward: Same as above — the factor is only applied in n out of m.
Model-level logic: The model itself may apply routed_scaling_factor, but it cannot know whether self.topk or self.experts have already done so.

This results in uncertain and inconsistent scaling, as the model layer has no visibility into whether routed_scaling_factor has already been applied upstream.

🎪 TL;DR: The model can't know if it should apply * routed_scaling_factor or not, because topk and experts may or may not have already done it, depending on the code path.

My PR forces model layer to apply * routed_scaling_factor when EPMOE is enabled because from the current codebase, epmoe won't apply *routed_scaling_factor by itself.

Need to follow up and possibly refactor sglang moe codebase to make it clear which layer should apply * routed_scaling_factor

Accuracy Test

GLM4.5 GSM8K accuracy jumps from 0.745 to 0.965 under EPMOE which is the same as TP & DeepEP results
Need to run gsm8k for deepseek-v3 too - WIP

Benchmark & Profiling

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

…er if epmoe is enabled

gemini-code-assist · 2025-07-29T06:51:47Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

trevor-m · 2025-07-29T16:31:18Z

python/sglang/srt/models/deepseek_v2.py

@@ -506,8 +506,11 @@ def forward_normal(
        else:
            kwargs["router_logits"] = router_logits
        final_hidden_states = self.experts(**kwargs)
-        if not _is_cuda and not _use_aiter:
+        if not _is_cuda and not _use_aiter or global_server_args_dict["enable_ep_moe"]:


Maybe we can have a property in self.experts whether to apply routing scaling factor in the model?
Or can we fuse it for EpMoe too?

These are the scenarios I know about for cuda:

FusedMoE triton backend: multiply is fused into moe_sum_reduce, but also need to divide from shared experts in biased_grouped_topk to cancel it out

FusedMoE model_opt FP4 (with or without enable_ep_moe): applied in ModelOptNvFp4FusedMoEMethod but [1/2] sgl-kernel: Fuse routed scaling factor into select_experts #8364 will fuse multiply into biased_grouped_topk. I'm worried this change will cause it to be applied twice for this path

EpMoE - was missing but this PR will fix and bring to model forward

DeepEpMoE - in model forward

Let me fix this after 8515 is merged.

sgl-project#8511) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

#8511) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

sgl-project#8511) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay…

9fded2b

…er if epmoe is enabled

zhyncs requested review from trevor-m, kushanam and ch-wan July 29, 2025 07:01

trevor-m reviewed Jul 29, 2025

View reviewed changes

Update layer.py

17d6f77

ch-wan requested review from merrymercy, Ying1123, zhyncs, ispobock, HaiShaw and BBuf as code owners August 1, 2025 02:25

ch-wan added 2 commits July 31, 2025 19:26

Update deepseek_v2.py

7103e54

Update deepseek_v2.py

40f1dec

trevor-m approved these changes Aug 1, 2025

View reviewed changes

ch-wan merged commit c8d3a40 into sgl-project:main Aug 1, 2025

TianQiLin666666 pushed a commit to TianQiLin666666/sglang that referenced this pull request Aug 1, 2025

Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… (

a081e7f

sgl-project#8511) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

lifuhuang pushed a commit that referenced this pull request Aug 3, 2025

Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… (

1694fae

#8511) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025

Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… (

a0703f3

#8511) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025

Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… (

a355dc7

#8511) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… (

dc61dde

sgl-project#8511) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 18, 2025

Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… (

8f31bdb

sgl-project#8511) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… #8511

Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… #8511

Uh oh!

byjiang1996 commented Jul 29, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jul 29, 2025

Uh oh!

trevor-m Jul 29, 2025 •

edited

Loading

Uh oh!

ch-wan Jul 29, 2025

Uh oh!

Uh oh!

Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… #8511

Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… #8511

Uh oh!

Conversation

byjiang1996 commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Jul 29, 2025

Uh oh!

trevor-m Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ch-wan Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

byjiang1996 commented Jul 29, 2025 •

edited

Loading

trevor-m Jul 29, 2025 •

edited

Loading