[NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE #8450

kaixih · 2025-07-28T07:26:36Z

A followup PR to enable Flashinfer MoE blockscale fp8 backend for TP MoE.

The previous PR is doing the same but for the EP MoE.

gemini-code-assist

Summary of Changes

Hello @kaixih, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request extends the system's Mixture-of-Experts (MoE) capabilities by integrating the Flashinfer block-scale FP8 backend for Tensor Parallel (TP) configurations. It introduces a new class and modifies the MoE dispatch and FP8 quantization logic to leverage this optimized kernel, decoupling its enablement from Expert Parallel (EP) MoE.

Highlights

Flashinfer MoE Backend for TP MoE: This pull request enables the Flashinfer block-scale FP8 backend for Tensor Parallel (TP) Mixture-of-Experts (MoE) configurations, building upon previous work that enabled it for Expert Parallel (EP) MoE.
New MoE Implementation Class: A new class, FlashInferFusedMoE, has been introduced to encapsulate the logic for Flashinfer-specific fused MoE operations, which is now conditionally selected by the system's MoE implementation dispatch logic.
FP8 Quantization Integration: The FP8 quantization module has been updated with a dedicated apply method to prepare inputs (including per-token group quantization and weight reordering) and invoke Flashinfer's trtllm_fp8_block_scale_moe kernel for optimized computation.
Decoupled MoE Enablement: The requirement for enable_ep_moe to be true when enable_flashinfer_trtllm_moe is active has been removed, allowing the Flashinfer TRTLLM MoE backend to be utilized independently for TP MoE scenarios.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request enables the Flashinfer MoE blockscale fp8 backend for Tensor Parallelism (TP) MoE, extending the functionality from a previous PR that handled Expert Parallelism (EP) MoE. The changes introduce a new FlashInferFusedMoE layer and adapt the FP8 quantization logic to support this new backend.

My review has identified a few issues:

Critical Correctness Issue: A new apply method in Fp8MoEMethod overwrites an existing one, which will break the standard MoE path. This needs to be resolved by merging the two methods.
Performance: An inefficient weight-swapping operation is performed on every forward pass. This should be moved to the weight-loading stage to avoid runtime overhead.
Maintainability: I've flagged several debug print statements that should be removed before merging, along with a hardcoded magic number and local imports that reduce code clarity.

The overall direction is good, but the identified issues, especially the critical one, must be addressed to ensure correctness and performance.

python/sglang/srt/layers/quantization/fp8.py

python/sglang/srt/layers/moe/ep_moe/layer.py

python/sglang/srt/layers/quantization/fp8.py

kaixih · 2025-07-28T20:14:18Z

Accuracy test results:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9598	±	0.0054
		strict-match	5	exact_match	↑	0.9560	±	0.0056

To repro:

export SGL_ENABLE_JIT_DEEPGEMM="0"
model_str=<path/to/DeepSeek-R1-0528>
lm_eval --model sglang \
    --model_args pretrained=$model_str,trust_remote_code=True,tp_size=8,max_model_len=32768,add_bos_token=True,enable_ep_moe=False,enable_flashinfer_trtllm_moe=True,disable_shared_experts_fusion=True \
    --tasks gsm8k  \
    --batch_size 512 --num_fewshot 5

kaixih · 2025-07-28T20:17:44Z

E2E perf shows ~10% improvement for low latency workloads:

default:
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 2
Successful requests:                     32
Benchmark duration (s):                  216.25
Total input tokens:                      32768
Total generated tokens:                  32768
Total generated tokens (retokenized):    32714
Request throughput (req/s):              0.15
Input token throughput (tok/s):          151.53
Output token throughput (tok/s):         151.53
Total token throughput (tok/s):          303.06
Concurrency:                             2.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   13513.93
Median E2E Latency (ms):                 13511.06
---------------Time to First Token----------------
Mean TTFT (ms):                          111.85
Median TTFT (ms):                        107.33
P99 TTFT (ms):                           187.56
---------------Inter-Token Latency----------------
Mean ITL (ms):                           13.10
Median ITL (ms):                         13.10
P95 ITL (ms):                            13.38
P99 ITL (ms):                            13.50
Max ITL (ms):                            21.13
==================================================
flashinfer:
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 2
Successful requests:                     32
Benchmark duration (s):                  193.43
Total input tokens:                      32768
Total generated tokens:                  32768
Total generated tokens (retokenized):    32711
Request throughput (req/s):              0.17
Input token throughput (tok/s):          169.40
Output token throughput (tok/s):         169.40
Total token throughput (tok/s):          338.80
Concurrency:                             2.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   12088.34
Median E2E Latency (ms):                 12080.40
---------------Time to First Token----------------
Mean TTFT (ms):                          140.17
Median TTFT (ms):                        132.30
P99 TTFT (ms):                           213.18
---------------Inter-Token Latency----------------
Mean ITL (ms):                           11.68
Median ITL (ms):                         11.69
P95 ITL (ms):                            11.93
P99 ITL (ms):                            12.04
Max ITL (ms):                            56.19

To repro:

export SGL_ENABLE_JIT_DEEPGEMM="0"
model_dir=<path/to/DeepSeek-R1-0528>
if [[ "$1" == "server" ]]; then
python3 -m sglang.launch_server --model-path $model_dir --trust-remote-code --tp-size 8 --disable-shared-experts-fusion --enable-flashinfer-trtllm-moe # or don't use this for default
fi

if [[ "$1" == "client" ]]; then
python3 -m sglang.bench_serving --backend sglang --model $model_dir --num-prompts 32 --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 1 --max-concurrency=2 #--profile
fi

fzyzcjy

some nits

python/sglang/srt/layers/moe/fused_moe_triton/layer.py

fzyzcjy

follow up tiny nit

python/sglang/srt/layers/moe/fused_moe_triton/layer.py

kaixih · 2025-07-31T06:24:18Z

@fzyzcjy @zhyncs Any thing else we can do on our end?

zhyncs · 2025-07-31T18:57:57Z

Hi @kaixih may you rebase? Thanks!

kaixih · 2025-07-31T23:19:39Z

Thx for the headsup. Rebased and conflicts resolved. @zhyncs

kaixih · 2025-07-31T23:25:50Z

Note: this PR is also necessary for this merged PR (EP MoE + TRT-LLM kernel), since recent refactoring changes have broken it. The fix is included in this PR.

python/sglang/srt/layers/moe/fused_moe_triton/layer.py

gemini-code-assist · 2025-08-01T01:02:05Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2025-08-01T02:56:37Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…-project#8450) Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>

Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>

…-project#8450) Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>

kaixih requested review from merrymercy, Ying1123, zhyncs, ispobock, HaiShaw, ch-wan, BBuf, hnyls2002 and ByronHsu as code owners July 28, 2025 07:26

gemini-code-assist bot reviewed Jul 28, 2025

View reviewed changes

kaixih force-pushed the kaixih/trtllm-gen-bs-fp8-tp branch 2 times, most recently from a83eb55 to ad0c778 Compare July 28, 2025 19:36

kaixih changed the title ~~Draft: Enable Flashinfer MoE blockscale fp8 backend for TP MoE~~ [NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE Jul 28, 2025

fzyzcjy reviewed Jul 29, 2025

View reviewed changes

python/sglang/srt/layers/moe/fused_moe_triton/layer.py Show resolved Hide resolved

python/sglang/srt/layers/moe/fused_moe_triton/layer.py Outdated Show resolved Hide resolved

kaixih requested a review from kushanam as a code owner July 29, 2025 19:18

kaixih force-pushed the kaixih/trtllm-gen-bs-fp8-tp branch from 2e79d5c to 42a8b20 Compare July 29, 2025 19:26

fzyzcjy reviewed Jul 29, 2025

View reviewed changes

python/sglang/srt/layers/moe/fused_moe_triton/layer.py Outdated Show resolved Hide resolved

kaixih added 2 commits July 31, 2025 21:24

Support trtllm moe blockscale fp8 in TP MoE

5cef672

Resolve conflicts

a432df0

kaixih force-pushed the kaixih/trtllm-gen-bs-fp8-tp branch from 6e7da6c to a432df0 Compare July 31, 2025 23:17

Merge branch 'main' into kaixih/trtllm-gen-bs-fp8-tp

d5907e6

zhyncs self-assigned this Jul 31, 2025

zhyncs added the high priority label Jul 31, 2025

zhyncs assigned ch-wan Jul 31, 2025

kushanam approved these changes Aug 1, 2025

View reviewed changes

zhyncs reviewed Aug 1, 2025

View reviewed changes

python/sglang/srt/layers/moe/fused_moe_triton/layer.py Show resolved Hide resolved

kushanam merged commit aa4c66b into sgl-project:main Aug 1, 2025
98 of 117 checks passed

huangzhilin-hzl pushed a commit to huangzhilin-hzl/sglang that referenced this pull request Aug 1, 2025

[NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (sgl…

a85ebd8

…-project#8450) Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>

TianQiLin666666 pushed a commit to TianQiLin666666/sglang that referenced this pull request Aug 1, 2025

[NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (sgl…

bf7ddb6

…-project#8450) Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>

lifuhuang pushed a commit that referenced this pull request Aug 3, 2025

[NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (#8450)

6cd6324

Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>

ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025

[NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (#8450)

7ebe8a0

Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>

ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025

[NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (#8450)

4fae3c6

Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

[NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (sgl…

39d89df

…-project#8450) Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 18, 2025

[NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (sgl…

745ce2e

…-project#8450) Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>

[NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE #8450

[NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE #8450

Uh oh!

Conversation

kaixih commented Jul 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaixih commented Jul 28, 2025

Uh oh!

kaixih commented Jul 28, 2025

Uh oh!

fzyzcjy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fzyzcjy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kaixih commented Jul 31, 2025

Uh oh!

zhyncs commented Jul 31, 2025

Uh oh!

kaixih commented Jul 31, 2025

Uh oh!

kaixih commented Jul 31, 2025

Uh oh!

Uh oh!

gemini-code-assist bot commented Aug 1, 2025

Uh oh!

Uh oh!

gemini-code-assist bot commented Aug 1, 2025

Uh oh!

Uh oh!

fzyzcjy left a comment •

edited

Loading