feat: integrate deepgemm into EPMoE #6821

xutizhou · 2025-06-03T02:18:56Z

Motivation

For normal EPMoE (no DeepEP), integrate DeepGEMM as an option.

This PR builds upon the work from pr5805 Credit goes to @TianQiLin666666, who authored most of the code presented here.

Co-authored-by: @TianQiLin666666

Modifications

Performance

two h20 node

node 1

python -m sglang.launch_server --model-path /path/to/DeepSeek-V3/    --trust-remote-code --tp-size 16 --enable-dp-attention --dp-size 16 --mem-fraction-static 0.8 --enable-ep-moe --disable-radix-cache --dist-init-addr 10.6.131.5:5000 --nnodes 2 --node-rank 0

node 2

python -m sglang.launch_server --model-path /path/to/DeepSeek-V3/    --trust-remote-code --tp-size 16 --enable-dp-attention --dp-size 16 --mem-fraction-static 0.8 --enable-ep-moe --disable-radix-cache  --dist-init-addr 10.6.131.5:5000 --nnodes 2 --node-rank 1

test

python3 -m sglang.bench_one_batch_server --model /path/to/DeepSeek-V3/ --base-url http://localhost:30000 --batch-size 1 16 32 64 128 --input-len 1024 --output-len 1024

Batch Size	Test Group	Output Throughput (tok/s)
1	w/o deepgemm	24.95
	w deepgemm	30.49
	Diff	+22.2%
16	w/o deepgemm	309.32
	w deepgemm	335.31
	Diff	+8.4%
32	w/o deepgemm	553.30
	w deepgemm	595.04
	Diff	+7.5%
64	w/o deepgemm	1003.07
	w deepgemm	1088.89
	Diff	+8.6%
128	w/o deepgemm	1644.28
	w deepgemm	1786.10
	Diff	+8.6%

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

…for hidden states and enhance forward_normal method by capturing hidden states' shape, dtype, and device.

…pep_moe' to include 'enable_ep_moe' in DeepseekV2ForCausalLM. sgl-project#6767

…_deepgemm_preprocess' in kernels.py and layer.py for consistency.

gemini-code-assist

Hello @xutizhou, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Gemini-code-assist here, providing a summary of this pull request. This PR aims to integrate DeepGEMM as an optional execution path within the EPMoE layer, specifically for the standard (non-DeepEP) mode. It builds upon previous work and introduces a new forward method (forward_deepgemm) that leverages DeepGEMM's capabilities, likely for performance improvements using FP8 quantization and masked grouped GEMM operations. The changes involve adding new Triton kernels for the DeepGEMM flow and modifying the EPMoE layer to incorporate this new execution path.

Highlights

Integrate DeepGEMM: Adds an alternative forward path in the EPMoE layer (forward_deepgemm) that utilizes DeepGEMM for potentially improved performance.
New Triton Kernels: Introduces several new Triton kernels (compute_masked_m_triton_kernel, deepgemm_compute_src2dst_triton_kernel, fill_gateup_input_triton_kernel, deepgemm_post_reorder_triton_kernel) to handle DeepGEMM-specific preprocessing, input filling, and post-reordering steps.
FP8 Support: The DeepGEMM path is designed to work with FP8 quantization for inputs and weights, performing masked grouped GEMM operations.
Conditional Execution: The choice between the normal and DeepGEMM forward path is controlled by feature flags (use_deep_gemm and _ENABLE_JIT_DEEPGEMM).

Changelog

Click here to see the changelog

python/sglang/srt/layers/moe/ep_moe/kernels.py
- Added compute_masked_m_triton_kernel for calculating expert-specific token counts.
- Added deepgemm_compute_src2dst_triton_kernel to map source token indices to destination indices within the reordered, grouped input.
- Added fill_gateup_input_triton_kernel to gather and fill the input tensor for the first grouped GEMM, including handling scales for quantization.
- Added deepgemm_post_reorder_triton_kernel to scatter and reorder the output from the second grouped GEMM back to the original token order, applying expert weights.
- Added exp2_upper helper function.
- Added moe_ep_deepgemm_preproess function to orchestrate the initial preprocessing steps for DeepGEMM.
python/sglang/srt/layers/moe/ep_moe/layer.py
- Imported new DeepGEMM-related kernels and functions.
- Added FP8 weight attributes (w13_weight_fp8, w2_weight_fp8) for use in the DeepGEMM path.
- Modified the main forward method to conditionally call forward_deepgemm or forward_normal based on flags.
- Implemented the forward_deepgemm method, which performs expert selection, calls the new preprocessing function, executes two masked grouped GEMM operations with FP8 inputs/weights, applies activation, and calls the new post-reordering kernel.
- Imported necessary flags and utilities (_ENABLE_JIT_DEEPGEMM, get_bool_env_var, is_cuda).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

MoE picks the best path,
DeepGEMM speeds up the math,
Kernels run so fast.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request integrates DeepGEMM into the EPMoE layer, providing an alternative computation path. The changes involve adding new Triton kernels for preprocessing and postprocessing specific to DeepGEMM, and modifying the EPMoE layer to conditionally use this new path. Overall, the integration seems to follow a logical flow for MoE operations.

I've identified a critical issue related to handling cases where block_shape is None in the preprocessing step, which could lead to an UnboundLocalError. Additionally, there are a few medium-severity points regarding an unused kernel parameter and an edge case in a utility function that would be good to address.

Please also ensure that comprehensive unit tests are added for this new DeepGEMM path, covering various configurations (including when block_shape is None and when it's provided). Documentation for the new kernels and the integration logic would also be highly beneficial for future maintainability.

Summary of Findings

Unused Kernel Parameter: The compute_masked_m_triton_kernel has an unused parameter N.
Edge Case in exp2_upper: The exp2_upper function might not return a power of 2 for inputs greater than 2^30 or less than or equal to 2^1.
Critical: Potential UnboundLocalError: In moe_ep_deepgemm_preproess, if block_shape is None, scale and gateup_input_scale are not defined but are used later, leading to a potential UnboundLocalError and incorrect arguments to a Triton kernel.
Python Style (Minor): In deepgemm_post_reorder_triton_kernel, if computed == False: could be if not computed: for conventional Python style. This was not commented on due to severity settings.

Merge Readiness

This pull request introduces significant new functionality by integrating DeepGEMM. While the overall structure looks good, there is a critical issue in moe_ep_deepgemm_preproess concerning the handling of scale and gateup_input_scale when block_shape is None. This must be addressed before merging. Additionally, there are a couple of medium-severity suggestions for improving kernel parameters and utility function robustness.

I am unable to approve pull requests. Please ensure these changes are thoroughly tested and reviewed by other maintainers before merging.

gemini-code-assist · 2025-06-03T02:20:58Z

python/sglang/srt/layers/moe/ep_moe/kernels.py

+    if block_shape is not None:
+        assert len(block_shape) == 2
+        block_n, block_k = block_shape[0], block_shape[1]
+        hidden_states, scale = per_token_group_quant_fp8(hidden_states, block_k)
+
+        gateup_input_scale = torch.empty(
+            (gateup_input.size(0), gateup_input.size(1), scale.size(1)),
+            device=hidden_states.device,
+            dtype=scale.dtype,
+        )
+
+    fill_gateup_input_triton_kernel[(hidden_states.shape[0],)](
+        hidden_states,
+        scale,
+        gateup_input,
+        gateup_input_scale,
+        src2dst,
+        topk_ids,
+        start_expert_id,
+        end_expert_id,
+        top_k,
+        m_max,
+        hidden_states.size(1),
+        scale.size(1),
+        BLOCK_SIZE=1024,
+    )


There's a potential UnboundLocalError for scale and gateup_input_scale and incorrect arguments passed to fill_gateup_input_triton_kernel if block_shape is None.

If block_shape is None, scale (from per_token_group_quant_fp8) and gateup_input_scale are not defined within the if block_shape is not None: block (lines 1211-1220).

However, scale is unconditionally passed to fill_gateup_input_triton_kernel (line 1224), and gateup_input_scale is returned by the function (line 1244).

Additionally, hidden_states.size(1) and scale.size(1) are passed as hidden_size and scale_size to the kernel. If block_shape is None, hidden_states refers to the original unquantized tensor, but scale would be undefined, causing an error when scale.size(1) is accessed.

These variables need to be correctly defined and the appropriate (possibly quantized) hidden states and their sizes should be passed to the kernel regardless of the block_shape condition. Consider initializing scale and gateup_input_scale (e.g., to None or dummy tensors if the kernel requires them) and using a separate variable for the hidden states that are actually passed to the kernel.

# Initialize variables that might be conditionally defined or modified kernel_passed_hidden_states = hidden_states kernel_passed_scale = None # gateup_input_scale is returned, so it must be defined. # Initialize to None or a default based on expected behavior when block_shape is None. # If the kernel always expects a tensor, a dummy tensor should be created here. # For now, let's assume it can be None if not used by the kernel in that path. # This needs careful verification against the kernel's expectations. returnable_gateup_input_scale = None # Determine sizes for the kernel call, these will be updated if quantization occurs final_kernel_hidden_size = hidden_states.size(1) final_kernel_scale_size = 0 # Default if no scale quantization if block_shape is not None: assert len(block_shape) == 2 # block_n is not used in this part of the preprocessing for input quantization block_k = block_shape[1] quantized_hidden_states, scale_values = per_token_group_quant_fp8(hidden_states, block_k) kernel_passed_hidden_states = quantized_hidden_states kernel_passed_scale = scale_values returnable_gateup_input_scale = torch.empty( (gateup_input.size(0), gateup_input.size(1), kernel_passed_scale.size(1)), device=kernel_passed_hidden_states.device, # Use device of the tensor being processed dtype=kernel_passed_scale.dtype, ) final_kernel_hidden_size = kernel_passed_hidden_states.size(1) final_kernel_scale_size = kernel_passed_scale.size(1) else: # If block_shape is None, quantization via per_token_group_quant_fp8 is skipped. # The kernel fill_gateup_input_triton_kernel still expects scale_ptr and scale_size. # If the kernel requires valid tensors for scale and gateup_input_scale even when no # block_shape is provided (e.g., for a non-quantized path or different quant type), # dummy tensors should be created here. Otherwise, if it can handle None or scale_size=0, # current_scale = None and current_gateup_input_scale = None would be appropriate. # This example creates minimal dummy scales if the kernel cannot handle None. # This part needs to align with the Triton kernel's expectation for a no-quantization path. # For safety, providing dummy tensors if the kernel always reads them: kernel_passed_scale = torch.ones((hidden_states.shape[0], 1), device=hidden_states.device, dtype=torch.float32) returnable_gateup_input_scale = torch.ones((gateup_input.size(0), gateup_input.size(1), 1), device=hidden_states.device, dtype=torch.float32) final_kernel_scale_size = 1 # For the dummy scale fill_gateup_input_triton_kernel[(kernel_passed_hidden_states.shape[0],)]( kernel_passed_hidden_states, kernel_passed_scale, gateup_input, returnable_gateup_input_scale, src2dst, topk_ids, start_expert_id, end_expert_id, top_k, m_max, final_kernel_hidden_size, # Size of kernel_passed_hidden_states final_kernel_scale_size, # Size of kernel_passed_scale BLOCK_SIZE=1024, ) return ( m_max, masked_m[start_expert_id : (end_expert_id + 1)], expected_m, src2dst, gateup_input, returnable_gateup_input_scale, # Return the (potentially dummy or None) gateup_input_scale )

gemini-code-assist · 2025-06-03T02:20:58Z

python/sglang/srt/layers/moe/ep_moe/kernels.py

+
+
+@triton.jit
+def compute_masked_m_triton_kernel(seg_indptr, masked_m, num_experts, N):


The parameter N in the compute_masked_m_triton_kernel function signature appears to be unused within the kernel's body. If it's not required for any logic (e.g., boundary checks that might be missing or planned), could it be removed to simplify the signature and avoid confusion? The kernel is launched with a grid size of (num_experts,) and uses tl.program_id(0) to get expert_id, suggesting N might not be directly used for indexing in its current form.

gemini-code-assist · 2025-06-03T02:20:58Z

python/sglang/srt/layers/moe/ep_moe/kernels.py

+def exp2_upper(num: int) -> int:
+    for i in range(2, 31):
+        value = pow(2, i)
+        if num <= value:
+            return value
+    return num


The exp2_upper function calculates powers of 2 from 2^2 up to 2^30.

If num is less than or equal to 2^1 (e.g., 1 or 2), the loop condition num <= value might not behave as expected for the first few powers of 2, potentially returning num itself. For instance, if num=1, it returns 1. If num=2, it returns 2.

If num is greater than 2^30, the function returns num, which wouldn't be a power of 2.

Is this behavior intended for these edge cases? If the goal is to always find the smallest power of 2 greater than or equal to num (for num > 0), an alternative like 1 << (num - 1).bit_length() could be considered. If the current behavior is correct, adding a comment to clarify these boundary conditions would be helpful.

def exp2_upper(num: int) -> int: if num <= 0: # Or raise an error, or return a defined value like 1 or 2 # depending on how non-positive inputs should be handled. # Assuming num is expected to be positive for this context. return 1 # Smallest power of 2, or handle error if num == 1: return 1 # 2^0, or 2 if strictly greater power of 2 is needed # Smallest power of 2 greater than or equal to num return 1 << (num - 1).bit_length()

Copilot

Pull Request Overview

This PR integrates DeepGEMM as an optional path into the EPMoE layer, introducing new Triton kernels and branching logic.

Add imports and branching in forward to select between DeepGEMM and normal execution.
Implement forward_deepgemm with new preprocessing, GroupGEMM, activation, and post-reorder steps.
Introduce new Triton kernels and helper functions in kernels.py.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
python/sglang/srt/layers/moe/ep_moe/layer.py	Added DeepGEMM path, new imports, and `forward_deepgemm`
python/sglang/srt/layers/moe/ep_moe/kernels.py	Introduced DeepGEMM-specific Triton kernels and `moe_ep_deepgemm_preproess`

Comments suppressed due to low confidence (4)

python/sglang/srt/layers/moe/ep_moe/layer.py:255

The variable use_deep_gemm is not defined in this scope. You likely intended to read from a configuration or environment variable (e.g., self.use_deep_gemm or using get_bool_env_var).

if use_deep_gemm and _ENABLE_JIT_DEEPGEMM:

python/sglang/srt/layers/moe/ep_moe/layer.py:292

The function get_col_major_tma_aligned_tensor is not imported or defined in this file, causing a NameError at runtime. Please import it from the appropriate module.

get_col_major_tma_aligned_tensor(gateup_input_scale),

python/sglang/srt/layers/moe/ep_moe/layer.py:302

The function m_grouped_gemm_fp8_fp8_bf16_nt_masked is used but not imported; this will cause an undefined name error. Please import it where needed.

m_grouped_gemm_fp8_fp8_bf16_nt_masked(

python/sglang/srt/layers/moe/ep_moe/kernels.py:1222

Variable scale may be undefined when block_shape is None, leading to a runtime error. Ensure scale is initialized in all code paths before use.

fill_gateup_input_triton_kernel[(hidden_states.shape[0],)](

Copilot · 2025-06-03T02:21:34Z

python/sglang/srt/layers/moe/ep_moe/layer.py

    ep_gather,
    ep_scatter,
    gelu_and_mul_triton_kernel,
    grouped_gemm_triton,
+    moe_ep_deepgemm_preproess,


[nitpick] The function name moe_ep_deepgemm_preproess seems to have a typo (preproess); consider renaming it to moe_ep_deepgemm_preprocess for clarity.

Suggested change

moe_ep_deepgemm_preproess,

moe_ep_deepgemm_preprocess,

…ked grouped GEMM, ensuring shape M is a multiple of block M. Remove unused exp2_upper function.

…er, simplifying the function signature and its usage in moe_ep_deepgemm_preprocess.

…e this for-loop and reuse it

…kernel, workflow is controlled by arg 'dst_start'

TianQiLin666666 · 2025-06-16T15:26:00Z

Accuracy

gsm8k

python3 bench_sglang.py --num-questions 320 --parallel 16 --data-path /data/datasets/gsm8k/test.jsonl --port 8080

mmlu

python3 bench_sglang.py --nsub 50 --port 8080 --data_dir /data/datasets/mmlu

	MMLU	GSM8K
main branch	0.809	0.909
this PR	0.809	0.909

… by repeating scales

ch-wan · 2025-06-22T22:26:12Z

/gemini review

gemini-code-assist

Code Review

The code changes integrate DeepGEMM into the EPMoE layer, providing an option to use DeepGEMM for potentially improved performance. The changes include modifications to the kernels, the EPMoE layer, and some test files. The code introduces new Triton kernels for preprocessing steps required by DeepGEMM and modifies the forward pass of the EPMoE layer to conditionally use DeepGEMM based on a flag.

python/sglang/srt/layers/moe/ep_moe/kernels.py

Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

This reverts commit 506c492.

Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

@mickqian

* Use seq_len_fill_value in the cuda graph runners (sgl-project#7233) * support custom weight loader for model runner (sgl-project#7122) Co-authored-by: kavioyu <kavioyu@tencent.com> * Fix AMD speculative decoding (sgl-project#7252) * [Refactor] OAI Server components (sgl-project#7167) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * OAI Server Skeleton & Core Utility Endpoints (sgl-project#7179) * [amd] Opt dsv3 moe (sgl-project#7160) Co-authored-by: wunhuang <wunhuang@amd.com> * update ci node for xeon (sgl-project#7265) * feat: mtp support dp-attention (sgl-project#6081) Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: ch-wan <cwan39@gatech.edu> * support qwen2 running on ascend npu device (sgl-project#7022) Co-authored-by: 刁莹煜 <diaoyingyu1@hisilicon.com> * Fix Deepseek R1 0528 FP4 tensor name mismatch issue during weights loading. (sgl-project#7164) * bugfix(tool call ebnf): Fix EBNF generation for optional function parameters (sgl-project#7283) * Fix AWQ Dequant and Weight Loading of deepseek v2 (sgl-project#6842) * fix: resolve b200 dsv3 mtp issue (sgl-project#7286) * ci: Fix test_ebnf_generate_all_optional_function_params (sgl-project#7288) * fix: only enable flash_attn test on sm80 sm90 (sgl-project#7289) * [PD] Support get local ip from NIC for PD disaggregation (sgl-project#7237) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [PD] Add custom memory pool option to support Mooncake PD with NVLink (sgl-project#7264) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Upstreaming hicache bug fixes (sgl-project#7267) * Update python API of activation, topk, norm and rope and remove vllm dependency (sgl-project#6614) Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com> Co-authored-by: jianan-gu <jianan.gu@intel.com> Co-authored-by: sdp <sdp@gnr799219.jf.intel.com> * Fix hicache benchmark script bug - some sampled input_request is [] (sgl-project#7300) * chore: change logs from`INFO` to `DEBUG` for dp and add force quit for tokenizer manager (sgl-project#7251) * update invalid link in doc (sgl-project#7297) * Fix mini_lb for PD with long output: limit chunk size of decode response (sgl-project#7301) Signed-off-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> Co-authored-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> * Fix profiler error when there are idle passes (sgl-project#7003) * [pd] optimize dockerfile for pd disaggregation (sgl-project#7319) Co-authored-by: zhyncs <me@zhyncs.com> * Merge PDLB (Prefill-Decode Load Balancer) into SGLang Router (sgl-project#7096) * Add more refactored openai test & in CI (sgl-project#7284) * fix: resolve blackwell deepep image issue (sgl-project#7331) * add seed in CPU UTs to avoid flaky failure (sgl-project#7333) * Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (sgl-project#7099) * Reintroduce tiny fix sampler error when prob is not contiguous (sgl-project#7354) * [Refactor] Clean up radix cache related API (sgl-project#7303) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * Put `_normalize_rid` before other normalization in `io_struct` (sgl-project#7363) * [PD] Transfer hidden states for mtp when disaggregation (sgl-project#7242) * [Bugfix][PD] Set conclude state before clear when failure happens (sgl-project#7362) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * docs: update installation (sgl-project#7366) * [Docker] optimize dockerfile remove deepep and blackwell merge it to… (sgl-project#7343) Co-authored-by: Yineng Zhang <me@zhyncs.com> * Clean unused import for mimo mtp model (sgl-project#7370) * [Bugfix]Fix hang bug using dp attention with HiRadixCache (sgl-project#7159) Signed-off-by: huanglong <huanglong@linux.alibaba.com> * [Doc] add embedding rerank doc (sgl-project#7364) * Fix judgment condition for enabling Deepseek V3/R1 shared expert fusion optimization (sgl-project#7371) * Feat/refactor embedding server (sgl-project#7322) * Purge VerlEngine (sgl-project#7326) Signed-off-by: Ata Fatahi <immrata@gmail.com> * support return logprobs for pipeline (sgl-project#7356) Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> * [PD] Optimize custom mem pool usage and bump mooncake version (sgl-project#7393) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Support THUDM/GLM-4-0414 (GLM-Z1) Glm4ForCausalLM architecture. (sgl-project#5485) * Refine OpenAI serving entrypoint to remove batch requests (sgl-project#7372) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: Chang Su <csu272@usc.edu> * [Feature] Comprehensive Hybrid Parallelism Support (sgl-project#6389) * [DeepSeekNextN] fix: residual of head norm can be None (sgl-project#7398) * [OAI refactor] Add rerank and score serving (sgl-project#7399) Co-authored-by: Chang Su <chang.s.su@oracle.com> * [OAI Server Refactor] [ChatCompletions & Completions] Implement UsageInfo Processor (sgl-project#7360) Co-authored-by: Chang Su <chang.s.su@oracle.com> * Fix All-Gather under world size one (sgl-project#7219) * Optimize DP attn scheduling for speculative decoding (sgl-project#7285) * Update usage_processor.py (sgl-project#7402) * Fix 7285 Merge Conflicts (sgl-project#7403) * chore: upgrade mooncake-transfer-engine 0.3.4 (sgl-project#7401) * [OAI Server Refactor] [ChatCompletions & Completions] Support Return Hidden State (sgl-project#7329) Signed-off-by: keru <rukeyang@gmail.com> * Remove batches api in docs & example (sgl-project#7400) * [BugFix]: fix EmbeddingReqInput single input error (sgl-project#7396) * [BugFix]fix qwen25 invoke function call streaming responses with curly braces as the starting indicator (sgl-project#7394) * fix overlap pagecount (sgl-project#6984) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * fix: Fix CI test_function_call_parser.py (sgl-project#7425) * Fix CPU offloading for MLA memory pool (sgl-project#7409) * [fix] PD disaggregation when enable mtp and tp!=dp (sgl-project#7420) * feat(oai refactor): Replace `openai_api` with `entrypoints/openai` (sgl-project#7351) Co-authored-by: Jin Pan <jpan236@wisc.edu> * Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support (sgl-project#7412) * refactor(test): reorganize OpenAI test file structure (sgl-project#7408) * [minor] simplify the `TokenToKVPoolAllocator` (sgl-project#7414) * Tiny add logging for GC (sgl-project#7406) * FlashInfer NVFP4 MoE with EP & 2-stream shared expert (sgl-project#7327) Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: alcanderian <alcanderian@gmail.com> * Remove copy after bmm (sgl-project#7441) * Fix torch compile run (sgl-project#7391) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> * [misc] Add PD service discovery support in router (sgl-project#7361) * add fused moe config for qwen3 in triton3.3.1 (sgl-project#7445) * Fix CUDA Graph Check under Deepep with DP FFN (sgl-project#7451) * Update hyperparameter_tuning.md (sgl-project#7454) * feat: integrate deepgemm into EPMoE (sgl-project#6821) Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * Solve docker build failed in the virtual machine (sgl-project#7290) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> Co-authored-by: HAI <hixiao@gmail.com> * Fix a bug in BatchTokenIDOut & Misc style and dependency updates (sgl-project#7457) * [CI] Upgrade mooncake to 0.3.4.post1 to fix 8 gpu tests (sgl-project#7472) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Fix prefill OOM due to wrong token calculation when page > 1 (sgl-project#7397) * feat(func_call): Add more check in `BaseFormatDetector.parse_streaming_increment` (sgl-project#7479) * Fix dtype for idle input in spec decoding (sgl-project#7456) * update mooncake in dockerfile (sgl-project#7480) * kvcache io kernels and test case (sgl-project#7382) * [perf] slightly imporve DeepSeek-R1-FP4 TP8 (sgl-project#7481) * Quick fix for DeepGemm requant to also cover MTP. (sgl-project#7378) * Support weight loading without mmap (sgl-project#7469) * ci: Revert openai_server related tests in AMD suites (sgl-project#7449) * Perormance: Enable cuda graph for dp idle batch (sgl-project#7269) Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: ch-wan <cwan39@gatech.edu> * bugfix: Prevent global mutation of conv.stop_str across requests (sgl-project#7347) Co-authored-by: Chang Su <chang.s.su@oracle.com> * Fix RequestValidationError response format (sgl-project#7487) * Fix MTP with Deepseek R1 Fp4 (sgl-project#7376) * chore: bump sgl-kernel v0.2.0 (sgl-project#7490) * chore: bump v0.4.8 (sgl-project#7493) * [AMD] add aiter fused moe in DeepEP path (sgl-project#7268) * enable aiter_biased_grouped_topk kernel (sgl-project#7423) * [PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236) * Remove cumsum_buffer initilization (sgl-project#7439) * [benchmark] fbgemm benchmark support bandwidth report and support fbgemm_cutlass_gmm (sgl-project#7422) * Support multi-thread model weight loading (sgl-project#7277) * [PD] NIXL: Register kv args in advance and cleanup finished requests (sgl-project#6717) * fix: Add `--model` as an alias for `--model-path` in server_args (sgl-project#7505) * misc: Improvement to serving_chat.py and add more ut (sgl-project#7489) * Fuse sorted_token_ids padding to moe_align_block_size kernel (sgl-project#7437) * [OAI] patch origin request_id logic (sgl-project#7508) * [PD][Spec] Fix hidden state transfer for spec decode (sgl-project#7516) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * EPLB support for MTP (sgl-project#7510) * clean duplicate code (sgl-project#7512) * [ci] add router benchmark script and CI (sgl-project#7498) * fix: force synchronization between TP workers when update_weights (sgl-project#6626) Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> * [CPU] [BF16] Call fused_experts_cpu, weight_packed_linear and bmm_cpu kernel in DeepSeek model (sgl-project#6641) Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> * [CI] Upgrade mooncake to v0.3.4.post2 to fix potential slice failed bug (sgl-project#7522) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * npu fused op (sgl-project#7386) Co-authored-by: Li Junwen <lijunwen13@hisilicon.com> * feat: send kvmetrics from sglang scheduler (sgl-project#6721) * [PD] Add different TP sizes support for no-MLA models (sgl-project#6793) Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com> * enable aiter fp8 blockscale quant (sgl-project#7520) * take aiter get_rope back (sgl-project#7521) * Fix typo of flash_cache (sgl-project#7513) * feat: add return hidden_states at async generation (sgl-project#7507) * minor: 'role' must be system/assistant/tool, but case insensitive for now (sgl-project#7499) * Fix FP8 KV Cache Support in FA3 Backend (sgl-project#7148) * Fix gathered_buffer issues in tbo (sgl-project#7531) * [PD] Raise error for incompatible mooncake version and some minor fixes (sgl-project#7527) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [CMake] Fix sgl-kernel CMakeLists for Blackwell (sgl-project#7543) * Add Tencent HunYuanMoEV1 model support (sgl-project#7549) * Update seed in CPU UTs to avoid flaky failure with single test (sgl-project#7544) * chore: improve ci bug reporting (sgl-project#7542) * chore: remove vlm unnecessary import (sgl-project#7541) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com> * chore: bump v0.4.8.post1 (sgl-project#7559) * [PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND (sgl-project#7330) * [Fix] incorrect assert in EPLB (sgl-project#7575) * Updates Gemma3n MLP layer to adapt latest transformers version (sgl-project#7573) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Fix MTP error when enabling two-batch overlap (sgl-project#7569) * Add e2e test for multi instance multi stage memory release/resume occupuation (sgl-project#7208) Signed-off-by: Ata Fatahi <immrata@gmail.com> * [CI] Add CI Testing for Prefill-Decode Disaggregation with Router (sgl-project#7540) * Updates transformers and timm dependencies (sgl-project#7577) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * feat: support compatibility between MTP and two-batch-overlap (sgl-project#7225) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * Move multimodal processors into a separate folder (sgl-project#7581) * Fix broken CI TestVILAServer (sgl-project#7610) * [router] add centralized configuration module for sgl-router (sgl-project#7588) * Fix: Minicpm (sgl-project#7612) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Hybrid kv cache for LLaMA4 (sgl-project#6563) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: tarinkk <rt572@physics.rutger.edu> Co-authored-by: tarinkk <rt572@rutgers.physics.edu> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> * [CPU] add optimizations for INT8 and FP8 DeepSeek (sgl-project#6769) Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com> * Tiny add logs for expert location updater (sgl-project#7308) * Fix flakiness in LoRA batch test. (sgl-project#7552) * [BUG] fix local_rank in initialize_dp_attention (sgl-project#7584) * Support dynamic LoRA loading / unloading in engine/server API (sgl-project#7446) * [PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated (sgl-project#7598) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * fix unit tests (sgl-project#7618) * Let ep_scatter support arbitrary strides / ue8m0 format (sgl-project#7309) * Let EP prefill support new DeepGEMM (sgl-project#7310) * docs: add gb200 nvl72 and a16z grant (sgl-project#7620) * oai: Adds support for OpenAI chat completions API in bench_serving (sgl-project#7036) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> * [bugfix] Remove PR comment posting from Rust benchmark workflow (sgl-project#7625) * [Minor] clean up multimodal processor and tokenizer manager (sgl-project#7624) * Add dsv3 fused a gemm to sgl-kernel (sgl-project#7630) * Add @mickqian as the CODEOWNERS of multimodal (sgl-project#7636) * Fix stream reasoning parser and Adds Kimi reasoning parser (sgl-project#7432) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Fix sgl-router startup crash (sgl-project#7619) * [bugfix] fix runtime dropping panic in editable (sgl-project#7628) * Move files related to EPLB (sgl-project#7580) * [misc] reduce weird rope_scaling_factor warning (sgl-project#7176) * [AMD] Add unit-test-sgl-kernel-amd to AMD CI (sgl-project#7539) * Update CODEOWNERS (sgl-project#7640) * [EAGLE] remove a wrong adjustment for page_size > 1 & topk > 1 in server_args.py (sgl-project#7643) * [CPU] add c++ kernel to bind CPU cores and memory node (sgl-project#7524) * Improve streaming, log_level, memory report, weight loading, and benchmark script (sgl-project#7632) Co-authored-by: Kan Wu <wukanustc@gmail.com> * Add dsv3 router gemm kernel (sgl-project#7627) * chore: upgrade flashinfer v0.2.7 jit (sgl-project#7663) * [doc] update lws doc for pd (sgl-project#7318) * Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes (sgl-project#7648) * Add small requirements for benchmark/parse_result tools (sgl-project#7671) * [CPU] remove process_group from inputs of shm_allreduce and shm_allgather (sgl-project#7486) * chore: bump sgl-kernel v0.2.1 (sgl-project#7675) * support llama4 eagle3 (sgl-project#6985) Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Shenggui Li <somerlee.9@gmail.com> Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: yizhang2077 <1109276519@qq.com> * Refactor mm processors and Enable mixed modality processing (sgl-project#7629) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * upgrade sgl kernel to 0.2.1 for main (sgl-project#7676) * add description for llama4 eagle3 (sgl-project#7688) * fix(model loader): use safe_open to prevent file handle leaks. (sgl-project#7684) * chore: upgrade flashinfer v0.2.7.post1 (sgl-project#7698) * Improve error handling for requests with unloaded LoRA path(s) (sgl-project#7642) * Apply dsv3_fused_a_gemm kernel (sgl-project#7635) * Fix GPTQMarlinMoE (sgl-project#7697) * [1/n] apply wna16marlin kernel in moe weight only quantization (sgl-project#7683) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: yych0745 <1398089567@qq.com> Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: 弋云 <yiyun.wyt@antgroup.com> Co-authored-by: walker-ai <2398833647@qq.com> * Apply dsv3 router gemm kernel for deepseek-r1 fp4 (sgl-project#7677) * [AMD] Temporarily disable test_no_overlap_scheduler and test_vision_chunked_prefill (sgl-project#7717) * [RL] add --skip-warmup (sgl-project#7416) * [RL] support update_weights_from_distributed with different group and multiple weights (sgl-project#7292) * [router] add --log-level to sgl-router (sgl-project#6512) * [b200] support trt-llm allreduce fuse rms_norm_add kernel (sgl-project#7621) * [CPU] Bind threads and numa node for each TP rank (sgl-project#6549) Co-authored-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * Support non-contiguous query input for extend/decode attention (sgl-project#7462) * Support updating weights at once by stopping all requests (sgl-project#6698) Signed-off-by: Tianyu Zhou <albert.zty@antgroup.com> Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com> * Fix num_tokens_pre_allocated in disaggregation log (sgl-project#7714) * [CPU] [sgl-kernel] set dispatch key of initialize to CatchAll (sgl-project#7734) * [CPU] fix all_reduce and all_gather (sgl-project#6770) Co-authored-by: blzheng <beilei.zheng@intel.com> * fix awq and dsv3 fused gemm compatible (sgl-project#7735) * [CI][Router] Fix bench_one_batch_server for pd router test (sgl-project#7731) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture (sgl-project#7278) Co-authored-by: HydraQYH <QYH820@Outlook.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> * fix dsv3 fused proj check (sgl-project#7738) * Ascend attention backend(PA&MLA) (sgl-project#7722) Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: VDV1985 <vladdv85@mail.ru> * [fix] fix dsv3_router_gemm filter (sgl-project#7750) * [CPU] refine CPU integration code (sgl-project#7647) * [CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size (sgl-project#6771) * support qwen3 dense model dp attention (sgl-project#7681) * [optimize] add two stream norm for qwen3 (sgl-project#7740) Co-authored-by: ispobock <ispobaoke@gmail.com> * feat: use D2D instead of H2H in pp (sgl-project#7673) Co-authored-by: alpha-baby <fujianhao1997@qq.com> * [Bug] add flashinfer bool check for fusedmoe in Qwen moe models (sgl-project#7723) * [fix] put cpu in the first priority in get_device() (sgl-project#7752) * [optimize] fuse renormalize into moe_topk_softmax (sgl-project#7744) Co-authored-by: ispobock <ispobaoke@gmail.com> * chore: bump sgl-kernel 0.2.2 (sgl-project#7755) * fix CI: update native api ipynb (sgl-project#7754) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * fuse renormal into moe topk softmax kernel python code (sgl-project#7751) Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: zhyncs <me@zhyncs.com> * Remove type conversion and fix id map in topk (sgl-project#7759) * Add V2-lite model test (sgl-project#7390) Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com> * refactor llama4 dp attention logic (sgl-project#7729) * fix(docs): fix the broken link in `docs/references/production_metrics.md` (sgl-project#7741) Signed-off-by: rudeigerc <rudeigerc@gmail.com> * [fix] update bench_speculative.py for compatibility (sgl-project#7764) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * Move mem_fraction_static adjustment for multimodal models to `server_args.py` & Fix session control & Other cleanups (sgl-project#7748) * [RL] Add --nccl-port to prevent port conflict (sgl-project#7418) * [RL] add pause and continue generation for async rl training (sgl-project#7419) * [Fix] Alloc return type error (sgl-project#7778) Signed-off-by: Capronir <839972205@qq.com> * [feat] Support EAGLE3 for Qwen (sgl-project#7745) Co-authored-by: 纬杭 <ximing.wxm@antgroup.com> Co-authored-by: zyksir <zyksir@outlook.com> * saving hidden_states.clone() (sgl-project#7705) * [1/n]: add cutlass W4A8 moe kernel for hopper architecture (sgl-project#7772) Signed-off-by: yangsijia.614 <yangsijia.614@bytedance.com> Co-authored-by: yicwang <yichen.wang@bytedance.com> * add model: qwen2-audio (sgl-project#7596) * Optimize Hopper CUTLASS FP8 Blockwise Grouped GEMM Kernel in Small K Scenario (sgl-project#7782) * Embedding parallel by attn_tp (sgl-project#7623) * fix: fix apply_shuffle_mul_sum (sgl-project#7444) * chore: bump sgl-kernel v0.2.3 (sgl-project#7784) * fix: use nvidia-nccl-cu12 2.27.5 (sgl-project#7787) * DP Attention with Auto DeepEP Dispatch (sgl-project#7222) * chore: upgrade sgl-kernel v0.2.3 (sgl-project#7786) * Fix incorrect spec_num_draft_tokens in draft_extend (sgl-project#7757) * [fix] fix misusing of is_cuda (sgl-project#7790) * Add treemask mode to build_eagle_tree & release sgl-kernel 0.2.3 (sgl-project#7756) Co-authored-by: Pranjal Shankhdhar <pranjal.ssh@gmail.com> * chore: bump sgl-kernel v0.2.4 (sgl-project#7800) * ci: fix port args (sgl-project#7792) * Fix CI test OOM issue. (sgl-project#7799) * chore: upgrade sgl-kernel v0.2.4 (sgl-project#7801) * chore: bump v0.4.9 (sgl-project#7802) * fix merge conflict issue * fix hpu attention nonetyep issue * fix alignment * fix alignment2 * Ci failure fixes * fix attention-backend choices --------- Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> Signed-off-by: huanglong <huanglong@linux.alibaba.com> Signed-off-by: Ata Fatahi <immrata@gmail.com> Signed-off-by: keru <rukeyang@gmail.com> Signed-off-by: Tianyu Zhou <albert.zty@antgroup.com> Signed-off-by: rudeigerc <rudeigerc@gmail.com> Signed-off-by: Kay Yan <kay.yan@daocloud.io> Signed-off-by: Capronir <839972205@qq.com> Signed-off-by: yangsijia.614 <yangsijia.614@bytedance.com> Signed-off-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: KavioYu <67678385+yukavio@users.noreply.github.com> Co-authored-by: kavioyu <kavioyu@tencent.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com> Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com> Co-authored-by: u4lr451 <u4lr451@gmail.com> Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: ch-wan <cwan39@gatech.edu> Co-authored-by: Yijie Zhu <762412795@qq.com> Co-authored-by: 刁莹煜 <diaoyingyu1@hisilicon.com> Co-authored-by: Charles Chen <pychen96@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com> Co-authored-by: jianan-gu <jianan.gu@intel.com> Co-authored-by: sdp <sdp@gnr799219.jf.intel.com> Co-authored-by: Binyao Jiang <byjiang1996@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: linzhuo <15313137931lz@gmail.com> Co-authored-by: ch-tiger1 <tiger@ch-tech.ip-ddns.com> Co-authored-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com> Co-authored-by: Atream <80757050+Atream@users.noreply.github.com> Co-authored-by: Li Hui <lambert80.ios@gmail.com> Co-authored-by: Huang Long <121648372+LLLL114@users.noreply.github.com> Co-authored-by: woodx <124784234+woodx9@users.noreply.github.com> Co-authored-by: Ata Fatahi <immrata@gmail.com> Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com> Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> Co-authored-by: Wenbo Yang <solrex@users.noreply.github.com> Co-authored-by: Chang Su <csu272@usc.edu> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Keyang Ru <rukeyang@gmail.com> Co-authored-by: ehuaa <ehuamail@163.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Lifu Huang <lifu.hlf@gmail.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: alcanderian <alcanderian@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: xutizhou <xutingz@nvidia.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Yuhong Guo <guoyuhong1985@outlook.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: Alex Sun <alex.s@amd.com> Co-authored-by: valarLip <103567126+valarLip@users.noreply.github.com> Co-authored-by: Francis <38564764+ssssnow@users.noreply.github.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: xianzhiT <xianzhitang@tencent.com> Co-authored-by: yilian49 <43861414+yilian49@users.noreply.github.com> Co-authored-by: DangKai <dangkai4u@outlook.com> Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> Co-authored-by: ll819214 <18801269230@163.com> Co-authored-by: Li Junwen <lijunwen13@hisilicon.com> Co-authored-by: zixuanzhang226 <zixuanzhang@bytedance.com> Co-authored-by: Hongbo Xu <1320612015@qq.com> Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-authored-by: Meng, Peng <pengmeng@tencent.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com> Co-authored-by: tarinkk <rt572@physics.rutger.edu> Co-authored-by: tarinkk <rt572@rutgers.physics.edu> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com> Co-authored-by: Sheng Qi <shengqi2018@pku.edu.cn> Co-authored-by: finetune <82650881+finetunej@users.noreply.github.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: Kan Wu <wukanustc@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: narutolhy <582909902@qq.com> Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Shenggui Li <somerlee.9@gmail.com> Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: Simon_CQK <cqk0100@gmail.com> Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: yych0745 <1398089567@qq.com> Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: 弋云 <yiyun.wyt@antgroup.com> Co-authored-by: walker-ai <2398833647@qq.com> Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com> Co-authored-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> Co-authored-by: Albert <albert.zty@antgroup.com> Co-authored-by: Ziming Huang <1520787127@qq.com> Co-authored-by: ayrnb <70835312+ayrnb@users.noreply.github.com> Co-authored-by: HydraQYH <QYH820@Outlook.com> Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: VDV1985 <vladdv85@mail.ru> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: TianyuZhang1214 <tianyuzhang1214@163.com> Co-authored-by: alpha-baby <fujianhao1997@qq.com> Co-authored-by: Yuchen Cheng <rudeigerc@gmail.com> Co-authored-by: Kay Yan <kay.yan@daocloud.io> Co-authored-by: Caproni <40862361+Capronir@users.noreply.github.com> Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com> Co-authored-by: 纬杭 <ximing.wxm@antgroup.com> Co-authored-by: zyksir <zyksir@outlook.com> Co-authored-by: SijiaYang <yangsijia.614@bytedance.com> Co-authored-by: yicwang <yichen.wang@bytedance.com> Co-authored-by: Leng Yue <lengyue@lengyue.me> Co-authored-by: Qi Yuhang <45795032+HydraQYH@users.noreply.github.com> Co-authored-by: Gang Chen <13298548+MoonBall@users.noreply.github.com> Co-authored-by: Pranjal Shankhdhar <pranjal.ssh@gmail.com> Co-authored-by: jay <jthakur@habana.ai>

Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

TianQiLin666666 and others added 14 commits April 22, 2025 16:50

feat(ep_moe): integrate deepgemm into origin ep moe

92d647c

fix(ep_moe): group_gemm_mask bug

e057acb

fix bugs

19ec50e

fix bugs

3ce1a91

fix(em_moe): offset bugs

3d51a71

fix(deepgemm): bugfix

c80fc3c

fix: remove redundant code

af94a8b

fix: clang-format

2022070

fix: remove print

55ea483

fix(ep_moe): replace EPMOE_USE_DEEPGEMM with _ENABLE_JIT_DEEPGEMM

988a522

merge main

1f81f01

Refactor moe_ep_deepgemm_preprocess to remove CUDA-specific handling …

b4ae984

…for hidden states and enhance forward_normal method by capturing hidden states' shape, dtype, and device.

Fix condition for expert fusion by updating the check for 'enable_dee…

c6d51d2

…pep_moe' to include 'enable_ep_moe' in DeepseekV2ForCausalLM. sgl-project#6767

Fix typo in function name from 'moe_ep_deepgemm_preproess' to 'moe_ep…

0397c25

…_deepgemm_preprocess' in kernels.py and layer.py for consistency.

xutizhou requested review from merrymercy, Ying1123, zhyncs, ispobock, HaiShaw, ch-wan and BBuf as code owners June 3, 2025 02:18

xutizhou requested a review from Copilot June 3, 2025 02:19

gemini-code-assist bot reviewed Jun 3, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 3, 2025

View reviewed changes

Copilot AI reviewed Jun 3, 2025

View reviewed changes

Update moe_ep_deepgemm_preprocess to adjust m_max calculation for mas…

aeb437d

…ked grouped GEMM, ensuring shape M is a multiple of block M. Remove unused exp2_upper function.

xutizhou requested review from hnyls2002, ByronHsu and zhaochenyang20 as code owners June 3, 2025 04:00

Refactor compute_masked_m_triton_kernel to remove num_experts paramet…

d2c19bf

…er, simplifying the function signature and its usage in moe_ep_deepgemm_preprocess.

TianQiLin666666 added 5 commits June 16, 2025 17:09

assert act=="silu" in epmoe forward_deepgemm

b51f324

fix(epmoe): remove _ENABLE_JIT_DEEPGEMM

ad9abb2

fix(fill_gateup_input_triton_kernel): pre-define a tl.arange() outsid…

1869b18

…e this for-loop and reuse it

replace deepgemm_post_reorder_triton_kernel with post_reorder_triton_…

77123c6

…kernel, workflow is controlled by arg 'dst_start'

fix args of post_reorder_triton_kernel in all tests and benchmarks

ddc0c33

TianQiLin666666 requested review from HandH1998, yizhang2077 and FlamingoPg as code owners June 16, 2025 15:12

ch-wan and others added 7 commits June 17, 2025 00:56

Merge branch 'main' into feat/ep_moe_deepgemm

8e71771

Merge branch 'main' into feat/ep_moe_deepgemm

a6d61f6

Merge branch 'main' into feat/ep_moe_deepgemm

e30b3ab

add num_fused_shared_experts

aafbe4e

fix(moe_deepgemm): convert per-tensor weight quant to per-block quant…

0ea5bd9

… by repeating scales

Merge branch 'main' into feat/ep_moe_deepgemm

87d68e9

Merge branch 'main' into feat/ep_moe_deepgemm

26040f4

ch-wan mentioned this pull request Jun 21, 2025

[Bug] Deepseek EP + DP Fail and Accuracy Crush #7041

Closed

5 tasks

ch-wan approved these changes Jun 22, 2025

View reviewed changes

gemini-code-assist bot reviewed Jun 22, 2025

View reviewed changes

Merge branch 'main' into feat/ep_moe_deepgemm

bbb127d

zhyncs merged commit 506c492 into sgl-project:main Jun 23, 2025
42 of 59 checks passed

lifuhuang added a commit that referenced this pull request Jun 27, 2025

Revert "feat: integrate deepgemm into EPMoE (#6821)"

572f052

This reverts commit 506c492.

lifuhuang mentioned this pull request Jun 27, 2025

[Bug] [CI regression] TestEpMoEFP8 #7586

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: integrate deepgemm into EPMoE #6821

feat: integrate deepgemm into EPMoE #6821

Uh oh!

xutizhou commented Jun 3, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 3, 2025

Uh oh!

gemini-code-assist bot Jun 3, 2025

Uh oh!

gemini-code-assist bot Jun 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 3, 2025

Uh oh!

TianQiLin666666 commented Jun 16, 2025

Uh oh!

ch-wan commented Jun 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!



		@triton.jit
		def compute_masked_m_triton_kernel(seg_indptr, masked_m, num_experts, N):

feat: integrate deepgemm into EPMoE #6821

feat: integrate deepgemm into EPMoE #6821

Uh oh!

Conversation

xutizhou commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Performance

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

TianQiLin666666 commented Jun 16, 2025

Accuracy

Uh oh!

ch-wan commented Jun 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xutizhou commented Jun 3, 2025 •

edited

Loading