Skip to content

Conversation

kaixih
Copy link
Collaborator

@kaixih kaixih commented Jul 28, 2025

A followup PR to enable Flashinfer MoE blockscale fp8 backend for TP MoE.

The previous PR is doing the same but for the EP MoE.

cc. @kushanam

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kaixih, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request extends the system's Mixture-of-Experts (MoE) capabilities by integrating the Flashinfer block-scale FP8 backend for Tensor Parallel (TP) configurations. It introduces a new class and modifies the MoE dispatch and FP8 quantization logic to leverage this optimized kernel, decoupling its enablement from Expert Parallel (EP) MoE.

Highlights

  • Flashinfer MoE Backend for TP MoE: This pull request enables the Flashinfer block-scale FP8 backend for Tensor Parallel (TP) Mixture-of-Experts (MoE) configurations, building upon previous work that enabled it for Expert Parallel (EP) MoE.
  • New MoE Implementation Class: A new class, FlashInferFusedMoE, has been introduced to encapsulate the logic for Flashinfer-specific fused MoE operations, which is now conditionally selected by the system's MoE implementation dispatch logic.
  • FP8 Quantization Integration: The FP8 quantization module has been updated with a dedicated apply method to prepare inputs (including per-token group quantization and weight reordering) and invoke Flashinfer's trtllm_fp8_block_scale_moe kernel for optimized computation.
  • Decoupled MoE Enablement: The requirement for enable_ep_moe to be true when enable_flashinfer_trtllm_moe is active has been removed, allowing the Flashinfer TRTLLM MoE backend to be utilized independently for TP MoE scenarios.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables the Flashinfer MoE blockscale fp8 backend for Tensor Parallelism (TP) MoE, extending the functionality from a previous PR that handled Expert Parallelism (EP) MoE. The changes introduce a new FlashInferFusedMoE layer and adapt the FP8 quantization logic to support this new backend.

My review has identified a few issues:

  • Critical Correctness Issue: A new apply method in Fp8MoEMethod overwrites an existing one, which will break the standard MoE path. This needs to be resolved by merging the two methods.
  • Performance: An inefficient weight-swapping operation is performed on every forward pass. This should be moved to the weight-loading stage to avoid runtime overhead.
  • Maintainability: I've flagged several debug print statements that should be removed before merging, along with a hardcoded magic number and local imports that reduce code clarity.

The overall direction is good, but the identified issues, especially the critical one, must be addressed to ensure correctness and performance.

@kaixih kaixih force-pushed the kaixih/trtllm-gen-bs-fp8-tp branch 2 times, most recently from a83eb55 to ad0c778 Compare July 28, 2025 19:36
@kaixih
Copy link
Collaborator Author

kaixih commented Jul 28, 2025

Accuracy test results:

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9598 ± 0.0054
strict-match 5 exact_match 0.9560 ± 0.0056

To repro:

export SGL_ENABLE_JIT_DEEPGEMM="0"
model_str=<path/to/DeepSeek-R1-0528>
lm_eval --model sglang \
    --model_args pretrained=$model_str,trust_remote_code=True,tp_size=8,max_model_len=32768,add_bos_token=True,enable_ep_moe=False,enable_flashinfer_trtllm_moe=True,disable_shared_experts_fusion=True \
    --tasks gsm8k  \
    --batch_size 512 --num_fewshot 5

@kaixih
Copy link
Collaborator Author

kaixih commented Jul 28, 2025

E2E perf shows ~10% improvement for low latency workloads:

default:
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 2
Successful requests:                     32
Benchmark duration (s):                  216.25
Total input tokens:                      32768
Total generated tokens:                  32768
Total generated tokens (retokenized):    32714
Request throughput (req/s):              0.15
Input token throughput (tok/s):          151.53
Output token throughput (tok/s):         151.53
Total token throughput (tok/s):          303.06
Concurrency:                             2.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   13513.93
Median E2E Latency (ms):                 13511.06
---------------Time to First Token----------------
Mean TTFT (ms):                          111.85
Median TTFT (ms):                        107.33
P99 TTFT (ms):                           187.56
---------------Inter-Token Latency----------------
Mean ITL (ms):                           13.10
Median ITL (ms):                         13.10
P95 ITL (ms):                            13.38
P99 ITL (ms):                            13.50
Max ITL (ms):                            21.13
==================================================
flashinfer:
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 2
Successful requests:                     32
Benchmark duration (s):                  193.43
Total input tokens:                      32768
Total generated tokens:                  32768
Total generated tokens (retokenized):    32711
Request throughput (req/s):              0.17
Input token throughput (tok/s):          169.40
Output token throughput (tok/s):         169.40
Total token throughput (tok/s):          338.80
Concurrency:                             2.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   12088.34
Median E2E Latency (ms):                 12080.40
---------------Time to First Token----------------
Mean TTFT (ms):                          140.17
Median TTFT (ms):                        132.30
P99 TTFT (ms):                           213.18
---------------Inter-Token Latency----------------
Mean ITL (ms):                           11.68
Median ITL (ms):                         11.69
P95 ITL (ms):                            11.93
P99 ITL (ms):                            12.04
Max ITL (ms):                            56.19

To repro:

export SGL_ENABLE_JIT_DEEPGEMM="0"
model_dir=<path/to/DeepSeek-R1-0528>
if [[ "$1" == "server" ]]; then
python3 -m sglang.launch_server --model-path $model_dir --trust-remote-code --tp-size 8 --disable-shared-experts-fusion --enable-flashinfer-trtllm-moe # or don't use this for default
fi

if [[ "$1" == "client" ]]; then
python3 -m sglang.bench_serving --backend sglang --model $model_dir --num-prompts 32 --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 1 --max-concurrency=2 #--profile
fi

@kaixih kaixih changed the title Draft: Enable Flashinfer MoE blockscale fp8 backend for TP MoE [NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE Jul 28, 2025
Copy link
Collaborator

@fzyzcjy fzyzcjy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nits

@kaixih kaixih requested a review from kushanam as a code owner July 29, 2025 19:18
@kaixih kaixih force-pushed the kaixih/trtllm-gen-bs-fp8-tp branch from 2e79d5c to 42a8b20 Compare July 29, 2025 19:26
Copy link
Collaborator

@fzyzcjy fzyzcjy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up tiny nit

@kaixih
Copy link
Collaborator Author

kaixih commented Jul 31, 2025

@fzyzcjy @zhyncs Any thing else we can do on our end?

@zhyncs
Copy link
Member

zhyncs commented Jul 31, 2025

Hi @kaixih may you rebase? Thanks!

@kaixih kaixih force-pushed the kaixih/trtllm-gen-bs-fp8-tp branch from 6e7da6c to a432df0 Compare July 31, 2025 23:17
@kaixih
Copy link
Collaborator Author

kaixih commented Jul 31, 2025

Thx for the headsup. Rebased and conflicts resolved. @zhyncs

@kaixih
Copy link
Collaborator Author

kaixih commented Jul 31, 2025

Note: this PR is also necessary for this merged PR (EP MoE + TRT-LLM kernel), since recent refactoring changes have broken it. The fix is included in this PR.

@zhyncs zhyncs self-assigned this Jul 31, 2025
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@kushanam kushanam merged commit aa4c66b into sgl-project:main Aug 1, 2025
98 of 117 checks passed
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

huangzhilin-hzl pushed a commit to huangzhilin-hzl/sglang that referenced this pull request Aug 1, 2025
…-project#8450)

Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>
TianQiLin666666 pushed a commit to TianQiLin666666/sglang that referenced this pull request Aug 1, 2025
…-project#8450)

Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>
lifuhuang pushed a commit that referenced this pull request Aug 3, 2025
Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>
ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025
Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>
ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025
Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
…-project#8450)

Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 18, 2025
…-project#8450)

Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants