Skip to content

Conversation

alexsun07
Copy link
Contributor

@alexsun07 alexsun07 commented Jun 17, 2025

Motivation

For better performance in DeepEP path

Modifications

add aiter fused moe in DeepEP path

Checklist

Accuracy and performance

without aiter fused_moe:

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000
100%|██████████████████████████████████████████████████████████████| 1319/1319 [05:20<00:00,  4.12it/s]
Accuracy: 0.946
Invalid: 0.000
Latency: 321.660 s
Output throughput: 416.608 token/s

with aiter fused_moe

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 
100%|██████████████████████████████████████████████████████████████| 1319/1319 [02:13<00:00,  9.90it/s]
Accuracy: 0.945
Invalid: 0.000
Latency: 135.440 s
Output throughput: 990.551 token/s

We see significant performance uplift and same accuracy results.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @alexsun07, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant performance optimization for Mixture-of-Experts (MoE) layers within the DeepEP execution path, specifically targeting AMD GPUs. It integrates the aiter.fused_moe kernel, which provides a highly optimized implementation for MoE operations, leading to substantial improvements in inference throughput and reduced latency. The changes involve adapting the MoE layer's weight handling and forward pass, as well as adjusting the token dispatcher to leverage the fused kernel's internal permutation capabilities.

Highlights

  • Performance Optimization: Integrates aiter.fused_moe for Mixture-of-Experts (MoE) layers on AMD (HIP) devices, significantly improving performance in the DeepEP path. Benchmarks show a 2.3x increase in output throughput (from 416.608 token/s to 990.551 token/s) and a substantial reduction in latency (from 321.660s to 135.440s) with comparable accuracy.
  • Conditional Execution: Introduces a new SGLANG_USE_AITER environment variable to enable or disable the aiter integration, ensuring backward compatibility and flexibility for different hardware configurations.
  • MoE Layer Adaptations: Modifies the MoE layer's weight processing and forward pass to leverage aiter's optimized fused MoE kernel. This includes conditional weight shuffling (shuffle_weight) for w13_weight and w2_weight, and a dedicated forward_aiter method that wraps the aiter.fused_moe call.
  • Token Dispatcher Adjustments: Updates the token dispatcher to bypass its internal permutation logic (_deepep_permute) when aiter is active, as aiter.fused_moe handles token reordering internally for improved efficiency.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully integrates AITer's fused MoE kernels into the DeepEP path for AMD GPUs, demonstrating significant performance improvements as per the provided benchmarks. The changes are well-scoped and primarily involve conditional logic to enable AITer-specific paths for weight processing, MoE forward pass, and token dispatching.

The core logic for AITer integration, including weight shuffling, expert mask usage, and the new forward_aiter method, appears correct and aligns with typical patterns for using such optimized kernels.

One area for improvement is the duplicated definition of the _use_aiter flag across two files. Centralizing this would enhance maintainability. Additionally, a minor consideration for naming the shuffle dimensions as constants could improve code clarity if these values are not strictly fixed internal details of the AITer kernel.

Overall, the PR is a valuable performance enhancement.

@HaiShaw HaiShaw self-assigned this Jun 18, 2025
@HaiShaw
Copy link
Collaborator

HaiShaw commented Jun 19, 2025

@alexsun07 please provide full server launch commands for a reprod.

@alexsun07
Copy link
Contributor Author

alexsun07 commented Jun 21, 2025

@alexsun07 please provide full server launch commands for a reprod.

Sure! To enable aiter fused_moe for EP, please set env SGLANG_USE_AITER=1.

My launch script:

export SGLANG_USE_AITER=1

python3 -m sglang.launch_server \
    --trust-remote-code \
    --chunked-prefill-size 131072 \
    --attention-backend aiter \
    --tp-size 8 \
    --enable-deepep-moe \
    --deepep-mode normal \
    --model deepseek-ai/DeepSeek-V3

Copy link
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HaiShaw HaiShaw enabled auto-merge (squash) June 24, 2025 09:00
@HaiShaw HaiShaw merged commit 755f314 into sgl-project:main Jun 24, 2025
5 of 48 checks passed
whybeyoung pushed a commit to whybeyoung/sglang that referenced this pull request Jun 24, 2025
yilian49 pushed a commit to yilian49/sglang that referenced this pull request Jun 24, 2025
whybeyoung pushed a commit to whybeyoung/sglang that referenced this pull request Jun 24, 2025
@zyeric
Copy link

zyeric commented Jul 2, 2025

@alexsun07 please provide full server launch commands for a reprod.

Sure! To enable aiter fused_moe for EP, please set env SGLANG_USE_AITER=1.

My launch script:

export SGLANG_USE_AITER=1

python3 -m sglang.launch_server \
    --trust-remote-code \
    --chunked-prefill-size 131072 \
    --attention-backend aiter \
    --tp-size 8 \
    --enable-deepep-moe \
    --deepep-mode normal \
    --model deepseek-ai/DeepSeek-V3

I use similar command to launch a server for Qwen3 30BA3B at 0.4.8.post1 rocm docker. But it throws the following error

File "/sgl-workspace/sglang/python/sglang/srt/models/qwen2_moe.py", line 416, in <lambda>
    lambda idx, prefix: decoder_layer_type(
                        ^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_moe.py", line 539, in __init__
    self.mlp = Qwen3MoeSparseMoeBlock(
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_moe.py", line 139, in __init__
    self.deepep_dispatcher = MaybeTboDeepEPDispatcher(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/two_batch_overlap.py", line 618, in __init__
    DeepEPDispatcher(**kwargs) for _ in range(num_inner_dispatchers)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 668, in __init__
    self._normal_dispatcher = _DeepEPDispatcherImplNormal(
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 235, in __init__
    super().__init__(**kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/token_dispatcher.py", line 186, in __init__
    raise ImportError(
ImportError: DeepEP is not installed. Please install DeepEP package from https://github.com/deepseek-ai/deepep.

Could you help me fix this error, many thanks :)

@zyeric
Copy link

zyeric commented Jul 3, 2025

I also tried to launch the model in tp with aiter

export SGLANG_USE_AITER=1

python3 -m sglang.launch_server \
    --trust-remote-code \
    --chunked-prefill-size 131072 \
    --attention-backend aiter \
    --tp-size 8 \
    --model Qwen/Qwen3-30B-A3B

It will throw the following error

File "/sgl-workspace/aiter/aiter/fused_moe.py", line 476, in fused_moe_2stages
    a2 = stage1(
         ^^^^^^^
  File "/sgl-workspace/aiter/aiter/ops/moe_op.py", line 313, in ck_moe_stage1_fwd
    ck_moe_stage1(
  File "/sgl-workspace/aiter/aiter/jit/core.py", line 546, in wrapper
    module = get_module(md_name)
             ^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/aiter/aiter/jit/core.py", line 218, in get_module
    __mds[md_name] = importlib.import_module(f"{__package__}.{md_name}")
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level) 
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1324, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'aiter.jit.module_moe_ck2stages_b16_b16_b16_silu_no_mulWeightStage2'

I think it may be a bug since the compiled kernel is located at /home/aiscuser/.aiter/jit/build/lock_module_moe_ck2stages_b16_b16_b16_silu_no_mulWeightStage2 according to the log.

After that, I uninstalled the aiter in the docker and re-install it in the user space. This approach can avoid the error above and leads to a new error:

  File "/sgl-workspace/sglang/python/sglang/srt/custom_op.py", line 56, in forward
    return self._forward_method(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/custom_op.py", line 68, in forward_hip
    return self.forward_cuda(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 222, in forward_cuda
    return fused_moe(
           ^^^^^^^^^^
  File "/scratch/nishang/aiter/aiter/fused_moe.py", line 154, in fused_moe
    return fused_moe_2stages(
           ^^^^^^^^^^^^^^^^^^
  File "/scratch/nishang/aiter/aiter/fused_moe.py", line 476, in fused_moe_2stages
    a2 = stage1(
         ^^^^^^^
  File "/scratch/nishang/aiter/aiter/ops/moe_op.py", line 313, in ck_moe_stage1_fwd
    ck_moe_stage1(
  File "/scratch/nishang/aiter/aiter/jit/core.py", line 621, in wrapper
    return op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
RuntimeError: wrong! device_gemm with the specified compilation parameters does not support this GEMM problem

@zyeric
Copy link

zyeric commented Jul 3, 2025

Seems there is something wrong with the aiter backend. The server can be launched with tp=2.

@zyeric
Copy link

zyeric commented Jul 3, 2025

By the way, similar to the issue and bugfix, code here should be changed to

layer.w13_weight.data = shuffle_weight(layer.w13_weight.data, (16, 16))
torch.cuda.empty_cache()
layer.w2_weight.data = shuffle_weight(layer.w2_weight.data, (16, 16))
torch.cuda.empty_cache()

to enable RL training

@alexsun07
Copy link
Contributor Author

Hi @zyeric , thanks for reporting these issues

  1. DeepEP need to be installed in this path. We haven't released official docker image with DeepEP yet but maybe soon. Please stay tuned.
  2. This PR is a integration. Would you please raise an issue for the problem you met with normal TP and RL training?

@alexsun07 alexsun07 deleted the amd_deepep_aiter branch July 4, 2025 03:21
@alexsun07
Copy link
Contributor Author

alexsun07 commented Jul 4, 2025

By the way, similar to the issue and bugfix, code here should be changed to

layer.w13_weight.data = shuffle_weight(layer.w13_weight.data, (16, 16))
torch.cuda.empty_cache()
layer.w2_weight.data = shuffle_weight(layer.w2_weight.data, (16, 16))
torch.cuda.empty_cache()

to enable RL training

I see your point here. Thanks! Will fix soon.

@zyeric can you share more detailed error log or what attribute you need that are lost?

@zyeric
Copy link

zyeric commented Jul 4, 2025

By the way, similar to the issue and bugfix, code here should be changed to

layer.w13_weight.data = shuffle_weight(layer.w13_weight.data, (16, 16))
torch.cuda.empty_cache()
layer.w2_weight.data = shuffle_weight(layer.w2_weight.data, (16, 16))
torch.cuda.empty_cache()

to enable RL training

I see your point here. Thanks! Will fix soon.

@zyeric can you share more detailed error log or what attribute you need that are lost?

Thanks for the reply.
Making these modification is enough to launch the training successfully. But the rollout speed is much slower compared to the triton tp4 version. I will try to debug it when I have time.

I notice that the base image for rocm is quite old, do AMD has any plan to upgrade it in the near future? (The aiter import problem listed above can be fixed in this docker as well)

@alexsun07
Copy link
Contributor Author

I notice that the base image for rocm is quite old, do AMD has any plan to upgrade it in the near future? (The aiter import problem listed above can be fixed in this docker as well)

Yes, rocm6.4 is on the road.

cbx6664 pushed a commit to cbx6664/sglang_cbx that referenced this pull request Jul 16, 2025
cbx6664 pushed a commit to cbx6664/sglang_cbx that referenced this pull request Jul 16, 2025
chenxijun1029 pushed a commit to chenxijun1029/sglang that referenced this pull request Jul 17, 2025
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request Jul 17, 2025
* Use seq_len_fill_value in the cuda graph runners (sgl-project#7233)

* support custom weight loader for model runner (sgl-project#7122)

Co-authored-by: kavioyu <kavioyu@tencent.com>

* Fix AMD speculative decoding (sgl-project#7252)

* [Refactor] OAI Server components (sgl-project#7167)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

* OAI Server Skeleton & Core Utility Endpoints (sgl-project#7179)

* [amd] Opt dsv3 moe (sgl-project#7160)

Co-authored-by: wunhuang <wunhuang@amd.com>

* update ci node for xeon (sgl-project#7265)

* feat: mtp support dp-attention (sgl-project#6081)

Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>

* support qwen2 running on ascend npu device (sgl-project#7022)

Co-authored-by: 刁莹煜 <diaoyingyu1@hisilicon.com>

* Fix Deepseek R1 0528 FP4 tensor name mismatch issue during weights loading. (sgl-project#7164)

* bugfix(tool call ebnf): Fix EBNF generation for optional function parameters (sgl-project#7283)

* Fix AWQ Dequant and Weight Loading of deepseek v2 (sgl-project#6842)

* fix: resolve b200 dsv3 mtp issue (sgl-project#7286)

* ci: Fix test_ebnf_generate_all_optional_function_params (sgl-project#7288)

* fix: only enable flash_attn test on sm80 sm90 (sgl-project#7289)

* [PD] Support get local ip from NIC for PD disaggregation (sgl-project#7237)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* [PD] Add custom memory pool option to support Mooncake PD with NVLink  (sgl-project#7264)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* Upstreaming hicache bug fixes (sgl-project#7267)

* Update python API of activation, topk, norm and rope and remove vllm dependency (sgl-project#6614)

Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
Co-authored-by: jianan-gu <jianan.gu@intel.com>
Co-authored-by: sdp <sdp@gnr799219.jf.intel.com>

* Fix hicache benchmark script bug - some sampled input_request is [] (sgl-project#7300)

* chore: change logs from`INFO` to `DEBUG` for dp and add force quit for tokenizer manager (sgl-project#7251)

* update invalid link in doc (sgl-project#7297)

* Fix mini_lb for PD with long output: limit chunk size of decode response (sgl-project#7301)

Signed-off-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com>
Co-authored-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com>

* Fix profiler error when there are idle passes (sgl-project#7003)

* [pd] optimize dockerfile for  pd disaggregation (sgl-project#7319)

Co-authored-by: zhyncs <me@zhyncs.com>

* Merge PDLB (Prefill-Decode Load Balancer) into SGLang Router (sgl-project#7096)

* Add more refactored openai test & in CI (sgl-project#7284)

* fix: resolve blackwell deepep image issue (sgl-project#7331)

* add seed in CPU UTs to avoid flaky failure (sgl-project#7333)

* Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (sgl-project#7099)

* Reintroduce tiny fix sampler error when prob is not contiguous (sgl-project#7354)

* [Refactor] Clean up radix cache related API (sgl-project#7303)

Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

* Put `_normalize_rid` before other normalization in `io_struct` (sgl-project#7363)

* [PD] Transfer hidden states for mtp when disaggregation (sgl-project#7242)

* [Bugfix][PD] Set conclude state before clear when failure happens (sgl-project#7362)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* docs: update installation (sgl-project#7366)

* [Docker] optimize dockerfile  remove deepep and blackwell merge it to… (sgl-project#7343)

Co-authored-by: Yineng Zhang <me@zhyncs.com>

* Clean unused import for mimo mtp model (sgl-project#7370)

* [Bugfix]Fix hang bug using dp attention with HiRadixCache (sgl-project#7159)

Signed-off-by: huanglong <huanglong@linux.alibaba.com>

* [Doc] add embedding rerank doc (sgl-project#7364)

* Fix judgment condition for enabling Deepseek V3/R1 shared expert fusion optimization (sgl-project#7371)

* Feat/refactor embedding server (sgl-project#7322)

* Purge VerlEngine (sgl-project#7326)

Signed-off-by: Ata Fatahi <immrata@gmail.com>

* support return logprobs for pipeline (sgl-project#7356)

Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>

* [PD] Optimize custom mem pool usage and bump mooncake version (sgl-project#7393)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* Support THUDM/GLM-4-0414 (GLM-Z1) Glm4ForCausalLM architecture. (sgl-project#5485)

* Refine OpenAI serving entrypoint to remove batch requests (sgl-project#7372)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: Chang Su <csu272@usc.edu>

* [Feature] Comprehensive Hybrid Parallelism Support (sgl-project#6389)

* [DeepSeekNextN] fix: residual of head norm can be None (sgl-project#7398)

* [OAI refactor] Add rerank and score serving (sgl-project#7399)

Co-authored-by: Chang Su <chang.s.su@oracle.com>

* [OAI Server Refactor] [ChatCompletions & Completions] Implement UsageInfo Processor (sgl-project#7360)

Co-authored-by: Chang Su <chang.s.su@oracle.com>

* Fix All-Gather under world size one (sgl-project#7219)

* Optimize DP attn scheduling for speculative decoding (sgl-project#7285)

* Update usage_processor.py (sgl-project#7402)

* Fix 7285 Merge Conflicts (sgl-project#7403)

* chore: upgrade mooncake-transfer-engine 0.3.4 (sgl-project#7401)

* [OAI Server Refactor] [ChatCompletions & Completions] Support Return Hidden State (sgl-project#7329)

Signed-off-by: keru <rukeyang@gmail.com>

* Remove batches api in docs & example (sgl-project#7400)

* [BugFix]: fix EmbeddingReqInput single input error (sgl-project#7396)

* [BugFix]fix qwen25 invoke function call streaming responses with curly braces as the starting indicator (sgl-project#7394)

* fix overlap pagecount (sgl-project#6984)

Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

* fix: Fix CI test_function_call_parser.py (sgl-project#7425)

* Fix CPU offloading for MLA memory pool (sgl-project#7409)

* [fix] PD disaggregation when enable mtp and tp!=dp (sgl-project#7420)

* feat(oai refactor): Replace `openai_api` with `entrypoints/openai`  (sgl-project#7351)

Co-authored-by: Jin Pan <jpan236@wisc.edu>

* Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support (sgl-project#7412)

* refactor(test): reorganize OpenAI test file structure (sgl-project#7408)

* [minor] simplify the `TokenToKVPoolAllocator` (sgl-project#7414)

* Tiny add logging for GC  (sgl-project#7406)

* FlashInfer NVFP4 MoE with EP & 2-stream shared expert (sgl-project#7327)

Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
Co-authored-by: alcanderian <alcanderian@gmail.com>

* Remove copy after bmm (sgl-project#7441)

* Fix torch compile run (sgl-project#7391)

Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: Sai Enduri <saimanas.enduri@amd.com>

* [misc] Add PD service discovery support in router (sgl-project#7361)

* add fused moe config for qwen3 in triton3.3.1 (sgl-project#7445)

* Fix CUDA Graph Check under Deepep with DP FFN (sgl-project#7451)

* Update hyperparameter_tuning.md (sgl-project#7454)

* feat: integrate deepgemm into EPMoE (sgl-project#6821)

Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: TianQiLin666666 <1834987979@qq.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

* Solve docker build failed in the virtual machine (sgl-project#7290)

Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: Sai Enduri <saimanas.enduri@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>

* Fix a bug in BatchTokenIDOut & Misc style and dependency updates (sgl-project#7457)

* [CI] Upgrade mooncake to 0.3.4.post1 to fix 8 gpu tests (sgl-project#7472)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* Fix prefill OOM due to wrong token calculation when page > 1  (sgl-project#7397)

* feat(func_call): Add more check in `BaseFormatDetector.parse_streaming_increment` (sgl-project#7479)

* Fix dtype for idle input in spec decoding (sgl-project#7456)

* update mooncake in dockerfile (sgl-project#7480)

* kvcache io kernels and test case (sgl-project#7382)

* [perf] slightly imporve DeepSeek-R1-FP4 TP8 (sgl-project#7481)

* Quick fix for DeepGemm requant to also cover MTP. (sgl-project#7378)

* Support weight loading without mmap (sgl-project#7469)

* ci: Revert openai_server related tests in AMD suites (sgl-project#7449)

* Perormance: Enable cuda graph for dp idle batch (sgl-project#7269)

Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>

* bugfix: Prevent global mutation of conv.stop_str across requests (sgl-project#7347)

Co-authored-by: Chang Su <chang.s.su@oracle.com>

* Fix RequestValidationError response format (sgl-project#7487)

* Fix MTP with Deepseek R1 Fp4 (sgl-project#7376)

* chore: bump sgl-kernel v0.2.0 (sgl-project#7490)

* chore: bump v0.4.8 (sgl-project#7493)

* [AMD] add aiter fused moe in DeepEP path (sgl-project#7268)

* enable aiter_biased_grouped_topk kernel (sgl-project#7423)

* [PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236)

* Remove cumsum_buffer initilization (sgl-project#7439)

* [benchmark] fbgemm benchmark support bandwidth report and support fbgemm_cutlass_gmm (sgl-project#7422)

* Support multi-thread model weight loading (sgl-project#7277)

* [PD] NIXL: Register kv args in advance and cleanup finished requests (sgl-project#6717)

* fix: Add `--model` as an alias for `--model-path` in server_args (sgl-project#7505)

* misc: Improvement to serving_chat.py and add more ut (sgl-project#7489)

* Fuse sorted_token_ids padding to moe_align_block_size kernel (sgl-project#7437)

* [OAI] patch origin request_id logic (sgl-project#7508)

* [PD][Spec] Fix hidden state transfer for spec decode (sgl-project#7516)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* EPLB support for MTP (sgl-project#7510)

* clean duplicate code (sgl-project#7512)

* [ci] add router benchmark script and CI (sgl-project#7498)

* fix: force synchronization between TP workers when update_weights (sgl-project#6626)

Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com>

* [CPU] [BF16] Call fused_experts_cpu, weight_packed_linear and bmm_cpu kernel in DeepSeek model (sgl-project#6641)

Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>

* [CI] Upgrade mooncake to v0.3.4.post2 to fix potential slice failed bug (sgl-project#7522)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* npu fused op (sgl-project#7386)

Co-authored-by: Li Junwen <lijunwen13@hisilicon.com>

* feat: send kvmetrics from sglang scheduler (sgl-project#6721)

* [PD] Add different TP sizes support for no-MLA models (sgl-project#6793)

Co-authored-by: shangmingc <csmthu@gmail.com>
Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com>

* enable aiter fp8 blockscale quant (sgl-project#7520)

* take aiter get_rope back (sgl-project#7521)

* Fix typo of flash_cache (sgl-project#7513)

* feat: add return hidden_states at async generation (sgl-project#7507)

* minor: 'role' must be system/assistant/tool, but case insensitive for now (sgl-project#7499)

* Fix FP8 KV Cache Support in FA3 Backend (sgl-project#7148)

* Fix gathered_buffer issues in tbo (sgl-project#7531)

* [PD] Raise error for incompatible mooncake version and some minor fixes (sgl-project#7527)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* [CMake] Fix sgl-kernel CMakeLists for Blackwell (sgl-project#7543)

* Add Tencent HunYuanMoEV1 model support (sgl-project#7549)

* Update seed in CPU UTs to avoid flaky failure with single test (sgl-project#7544)

* chore: improve ci bug reporting (sgl-project#7542)

* chore: remove vlm unnecessary import (sgl-project#7541)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>

* chore: bump v0.4.8.post1 (sgl-project#7559)

* [PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND (sgl-project#7330)

* [Fix] incorrect assert in EPLB (sgl-project#7575)

* Updates Gemma3n MLP layer to adapt latest transformers version (sgl-project#7573)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

* Fix MTP error when enabling two-batch overlap  (sgl-project#7569)

* Add e2e test for multi instance multi stage memory release/resume occupuation (sgl-project#7208)

Signed-off-by: Ata Fatahi <immrata@gmail.com>

* [CI] Add CI Testing for Prefill-Decode Disaggregation with Router (sgl-project#7540)

* Updates transformers and timm dependencies (sgl-project#7577)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

* feat: support compatibility between MTP and two-batch-overlap (sgl-project#7225)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>

* Move multimodal processors into a separate folder (sgl-project#7581)

* Fix broken CI TestVILAServer (sgl-project#7610)

* [router] add centralized configuration module for sgl-router (sgl-project#7588)

* Fix: Minicpm (sgl-project#7612)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

* Hybrid kv cache for LLaMA4 (sgl-project#6563)

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: tarinkk <rt572@physics.rutger.edu>
Co-authored-by: tarinkk <rt572@rutgers.physics.edu>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>

* [CPU] add optimizations for INT8 and FP8 DeepSeek (sgl-project#6769)

Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>

* Tiny add logs for expert location updater (sgl-project#7308)

* Fix flakiness in LoRA batch test. (sgl-project#7552)

* [BUG] fix local_rank in initialize_dp_attention (sgl-project#7584)

* Support dynamic LoRA loading / unloading in engine/server API (sgl-project#7446)

* [PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated (sgl-project#7598)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* fix unit tests (sgl-project#7618)

* Let ep_scatter support arbitrary strides / ue8m0 format (sgl-project#7309)

* Let EP prefill support new DeepGEMM (sgl-project#7310)

* docs: add gb200 nvl72 and a16z grant (sgl-project#7620)

* oai: Adds support for OpenAI chat completions API in bench_serving (sgl-project#7036)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com>
Co-authored-by: Mick <mickjagger19@icloud.com>

* [bugfix] Remove PR comment posting from Rust benchmark workflow (sgl-project#7625)

* [Minor] clean up multimodal processor and tokenizer manager (sgl-project#7624)

* Add dsv3 fused a gemm to sgl-kernel (sgl-project#7630)

* Add @mickqian as the CODEOWNERS of multimodal (sgl-project#7636)

* Fix stream reasoning parser and Adds Kimi reasoning parser  (sgl-project#7432)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

* Fix sgl-router startup crash (sgl-project#7619)

* [bugfix] fix runtime dropping panic in editable (sgl-project#7628)

* Move files related to EPLB (sgl-project#7580)

* [misc] reduce weird rope_scaling_factor warning (sgl-project#7176)

* [AMD] Add unit-test-sgl-kernel-amd to AMD CI (sgl-project#7539)

* Update CODEOWNERS (sgl-project#7640)

* [EAGLE] remove a wrong adjustment for page_size > 1 & topk > 1 in server_args.py (sgl-project#7643)

* [CPU] add c++ kernel to bind CPU cores and memory node (sgl-project#7524)

* Improve streaming, log_level, memory report, weight loading, and benchmark script (sgl-project#7632)

Co-authored-by: Kan Wu <wukanustc@gmail.com>

* Add dsv3 router gemm kernel (sgl-project#7627)

* chore: upgrade flashinfer v0.2.7 jit (sgl-project#7663)

* [doc] update lws doc for pd (sgl-project#7318)

* Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes (sgl-project#7648)

* Add small requirements for benchmark/parse_result tools (sgl-project#7671)

* [CPU] remove process_group from inputs of shm_allreduce and shm_allgather (sgl-project#7486)

* chore: bump sgl-kernel v0.2.1 (sgl-project#7675)

* support llama4 eagle3  (sgl-project#6985)

Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: Shenggui Li <somerlee.9@gmail.com>
Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>

* Refactor mm processors and Enable mixed modality processing (sgl-project#7629)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

* upgrade sgl kernel to 0.2.1 for main (sgl-project#7676)

* add description for llama4 eagle3 (sgl-project#7688)

* fix(model loader): use safe_open to prevent file handle leaks. (sgl-project#7684)

* chore: upgrade flashinfer v0.2.7.post1 (sgl-project#7698)

* Improve error handling for requests with unloaded LoRA path(s) (sgl-project#7642)

* Apply dsv3_fused_a_gemm kernel (sgl-project#7635)

* Fix GPTQMarlinMoE (sgl-project#7697)

* [1/n] apply wna16marlin kernel in moe weight only quantization (sgl-project#7683)

Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: yych0745 <1398089567@qq.com>
Co-authored-by: HandH1998 <1335248067@qq.com>
Co-authored-by: 弋云 <yiyun.wyt@antgroup.com>
Co-authored-by: walker-ai <2398833647@qq.com>

* Apply dsv3 router gemm kernel for deepseek-r1 fp4 (sgl-project#7677)

* [AMD] Temporarily disable test_no_overlap_scheduler and test_vision_chunked_prefill (sgl-project#7717)

* [RL] add --skip-warmup (sgl-project#7416)

* [RL] support update_weights_from_distributed with different group and multiple weights (sgl-project#7292)

* [router] add --log-level to sgl-router (sgl-project#6512)

* [b200] support trt-llm allreduce fuse rms_norm_add kernel (sgl-project#7621)

* [CPU] Bind threads and numa node for each TP rank (sgl-project#6549)

Co-authored-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* Support non-contiguous query input for extend/decode attention (sgl-project#7462)

* Support updating weights at once by stopping all requests (sgl-project#6698)

Signed-off-by: Tianyu Zhou <albert.zty@antgroup.com>
Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com>

* Fix num_tokens_pre_allocated in disaggregation log (sgl-project#7714)

* [CPU] [sgl-kernel] set dispatch key of initialize to CatchAll (sgl-project#7734)

* [CPU] fix all_reduce and all_gather (sgl-project#6770)

Co-authored-by: blzheng <beilei.zheng@intel.com>

* fix awq and dsv3 fused gemm compatible (sgl-project#7735)

* [CI][Router] Fix bench_one_batch_server for pd router test (sgl-project#7731)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture (sgl-project#7278)

Co-authored-by: HydraQYH <QYH820@Outlook.com>
Co-authored-by: TianQiLin666666 <1834987979@qq.com>

* fix dsv3 fused proj check  (sgl-project#7738)

* Ascend attention backend(PA&MLA) (sgl-project#7722)

Co-authored-by: Maksim <makcum888e@mail.ru>
Co-authored-by: VDV1985 <vladdv85@mail.ru>

* [fix] fix dsv3_router_gemm filter (sgl-project#7750)

* [CPU] refine CPU integration code (sgl-project#7647)

* [CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size (sgl-project#6771)

* support qwen3 dense model dp attention (sgl-project#7681)

* [optimize] add two stream norm for qwen3 (sgl-project#7740)

Co-authored-by: ispobock <ispobaoke@gmail.com>

* feat: use D2D instead of H2H in pp (sgl-project#7673)

Co-authored-by: alpha-baby <fujianhao1997@qq.com>

* [Bug] add flashinfer bool check for fusedmoe in Qwen moe models (sgl-project#7723)

* [fix] put cpu in the first priority in get_device() (sgl-project#7752)

* [optimize] fuse renormalize into moe_topk_softmax (sgl-project#7744)

Co-authored-by: ispobock <ispobaoke@gmail.com>

* chore: bump sgl-kernel 0.2.2 (sgl-project#7755)

* fix CI: update native api ipynb (sgl-project#7754)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

* fuse renormal into moe topk softmax kernel python code (sgl-project#7751)

Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: zhyncs <me@zhyncs.com>

* Remove type conversion and fix id map in topk (sgl-project#7759)

* Add V2-lite model test (sgl-project#7390)

Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com>

* refactor llama4 dp attention logic (sgl-project#7729)

* fix(docs): fix the broken link in `docs/references/production_metrics.md` (sgl-project#7741)

Signed-off-by: rudeigerc <rudeigerc@gmail.com>

* [fix] update bench_speculative.py for compatibility (sgl-project#7764)

Signed-off-by: Kay Yan <kay.yan@daocloud.io>

* Move mem_fraction_static adjustment for multimodal models to `server_args.py` & Fix session control & Other cleanups (sgl-project#7748)

* [RL] Add --nccl-port to prevent port conflict (sgl-project#7418)

* [RL] add pause and continue generation for async rl training (sgl-project#7419)

* [Fix] Alloc return type error (sgl-project#7778)

Signed-off-by: Capronir <839972205@qq.com>

* [feat] Support EAGLE3 for Qwen (sgl-project#7745)

Co-authored-by: 纬杭 <ximing.wxm@antgroup.com>
Co-authored-by: zyksir <zyksir@outlook.com>

* saving hidden_states.clone() (sgl-project#7705)

* [1/n]: add cutlass W4A8 moe kernel for hopper architecture (sgl-project#7772)

Signed-off-by: yangsijia.614 <yangsijia.614@bytedance.com>
Co-authored-by: yicwang <yichen.wang@bytedance.com>

* add model: qwen2-audio (sgl-project#7596)

* Optimize Hopper CUTLASS FP8 Blockwise Grouped GEMM Kernel in Small K Scenario (sgl-project#7782)

* Embedding parallel by attn_tp (sgl-project#7623)

* fix: fix apply_shuffle_mul_sum (sgl-project#7444)

* chore: bump sgl-kernel v0.2.3 (sgl-project#7784)

* fix: use nvidia-nccl-cu12 2.27.5 (sgl-project#7787)

* DP Attention with Auto DeepEP Dispatch (sgl-project#7222)

* chore: upgrade sgl-kernel v0.2.3 (sgl-project#7786)

* Fix incorrect spec_num_draft_tokens in draft_extend (sgl-project#7757)

* [fix] fix misusing of is_cuda (sgl-project#7790)

* Add treemask mode to build_eagle_tree & release sgl-kernel 0.2.3 (sgl-project#7756)

Co-authored-by: Pranjal Shankhdhar <pranjal.ssh@gmail.com>

* chore: bump sgl-kernel v0.2.4 (sgl-project#7800)

* ci: fix port args (sgl-project#7792)

* Fix CI test OOM issue. (sgl-project#7799)

* chore: upgrade sgl-kernel v0.2.4 (sgl-project#7801)

* chore: bump v0.4.9 (sgl-project#7802)

* fix merge conflict issue

* fix hpu attention nonetyep issue

* fix alignment

* fix alignment2

* Ci failure fixes

* fix attention-backend choices

---------

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com>
Signed-off-by: huanglong <huanglong@linux.alibaba.com>
Signed-off-by: Ata Fatahi <immrata@gmail.com>
Signed-off-by: keru <rukeyang@gmail.com>
Signed-off-by: Tianyu Zhou <albert.zty@antgroup.com>
Signed-off-by: rudeigerc <rudeigerc@gmail.com>
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
Signed-off-by: Capronir <839972205@qq.com>
Signed-off-by: yangsijia.614 <yangsijia.614@bytedance.com>
Signed-off-by: Mohit Sinha <msinha@habana.ai>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: KavioYu <67678385+yukavio@users.noreply.github.com>
Co-authored-by: kavioyu <kavioyu@tencent.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com>
Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com>
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com>
Co-authored-by: u4lr451 <u4lr451@gmail.com>
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
Co-authored-by: Yijie Zhu <762412795@qq.com>
Co-authored-by: 刁莹煜 <diaoyingyu1@hisilicon.com>
Co-authored-by: Charles Chen <pychen96@gmail.com>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: YanbingJiang <yanbing.jiang@intel.com>
Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
Co-authored-by: jianan-gu <jianan.gu@intel.com>
Co-authored-by: sdp <sdp@gnr799219.jf.intel.com>
Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
Co-authored-by: linzhuo <15313137931lz@gmail.com>
Co-authored-by: ch-tiger1 <tiger@ch-tech.ip-ddns.com>
Co-authored-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
Co-authored-by: Simo Lin <linsimo.mark@gmail.com>
Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com>
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com>
Co-authored-by: Atream <80757050+Atream@users.noreply.github.com>
Co-authored-by: Li Hui <lambert80.ios@gmail.com>
Co-authored-by: Huang Long <121648372+LLLL114@users.noreply.github.com>
Co-authored-by: woodx <124784234+woodx9@users.noreply.github.com>
Co-authored-by: Ata Fatahi <immrata@gmail.com>
Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com>
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
Co-authored-by: Wenbo Yang <solrex@users.noreply.github.com>
Co-authored-by: Chang Su <csu272@usc.edu>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Keyang Ru <rukeyang@gmail.com>
Co-authored-by: ehuaa <ehuamail@163.com>
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Jin Pan <jpan236@wisc.edu>
Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>
Co-authored-by: Trevor Morris <tmorris@nvidia.com>
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
Co-authored-by: alcanderian <alcanderian@gmail.com>
Co-authored-by: Ke Bao <ISPObaoke@163.com>
Co-authored-by: Sai Enduri <saimanas.enduri@amd.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: xutizhou <xutingz@nvidia.com>
Co-authored-by: TianQiLin666666 <1834987979@qq.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Yuhong Guo <guoyuhong1985@outlook.com>
Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com>
Co-authored-by: Alex Sun <alex.s@amd.com>
Co-authored-by: valarLip <103567126+valarLip@users.noreply.github.com>
Co-authored-by: Francis <38564764+ssssnow@users.noreply.github.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: xianzhiT <xianzhitang@tencent.com>
Co-authored-by: yilian49 <43861414+yilian49@users.noreply.github.com>
Co-authored-by: DangKai <dangkai4u@outlook.com>
Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com>
Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>
Co-authored-by: ll819214 <18801269230@163.com>
Co-authored-by: Li Junwen <lijunwen13@hisilicon.com>
Co-authored-by: zixuanzhang226 <zixuanzhang@bytedance.com>
Co-authored-by: Hongbo Xu <1320612015@qq.com>
Co-authored-by: shangmingc <csmthu@gmail.com>
Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com>
Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com>
Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-authored-by: Meng, Peng <pengmeng@tencent.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com>
Co-authored-by: tarinkk <rt572@physics.rutger.edu>
Co-authored-by: tarinkk <rt572@rutgers.physics.edu>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>
Co-authored-by: Sheng Qi <shengqi2018@pku.edu.cn>
Co-authored-by: finetune <82650881+finetunej@users.noreply.github.com>
Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
Co-authored-by: Kan Wu <wukanustc@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: narutolhy <582909902@qq.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: Shenggui Li <somerlee.9@gmail.com>
Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com>
Co-authored-by: Simon_CQK <cqk0100@gmail.com>
Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com>
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: yych0745 <1398089567@qq.com>
Co-authored-by: HandH1998 <1335248067@qq.com>
Co-authored-by: 弋云 <yiyun.wyt@antgroup.com>
Co-authored-by: walker-ai <2398833647@qq.com>
Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com>
Co-authored-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>
Co-authored-by: Albert <albert.zty@antgroup.com>
Co-authored-by: Ziming Huang <1520787127@qq.com>
Co-authored-by: ayrnb <70835312+ayrnb@users.noreply.github.com>
Co-authored-by: HydraQYH <QYH820@Outlook.com>
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-authored-by: Maksim <makcum888e@mail.ru>
Co-authored-by: VDV1985 <vladdv85@mail.ru>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: TianyuZhang1214 <tianyuzhang1214@163.com>
Co-authored-by: alpha-baby <fujianhao1997@qq.com>
Co-authored-by: Yuchen Cheng <rudeigerc@gmail.com>
Co-authored-by: Kay Yan <kay.yan@daocloud.io>
Co-authored-by: Caproni <40862361+Capronir@users.noreply.github.com>
Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com>
Co-authored-by: 纬杭 <ximing.wxm@antgroup.com>
Co-authored-by: zyksir <zyksir@outlook.com>
Co-authored-by: SijiaYang <yangsijia.614@bytedance.com>
Co-authored-by: yicwang <yichen.wang@bytedance.com>
Co-authored-by: Leng Yue <lengyue@lengyue.me>
Co-authored-by: Qi Yuhang <45795032+HydraQYH@users.noreply.github.com>
Co-authored-by: Gang Chen <13298548+MoonBall@users.noreply.github.com>
Co-authored-by: Pranjal Shankhdhar <pranjal.ssh@gmail.com>
Co-authored-by: jay <jthakur@habana.ai>
shuaills pushed a commit to shuaills/sglang that referenced this pull request Jul 21, 2025
@whitememory
Copy link

Hi @zyeric , thanks for reporting these issues

  1. DeepEP need to be installed in this path. We haven't released official docker image with DeepEP yet but maybe soon. Please stay tuned.
  2. This PR is a integration. Would you please raise an issue for the problem you met with normal TP and RL training?

Hi @alexsun07 , thank you for this PR
Is there any tip to install DeepEP in AMD GPUs? (I am using mi300, sglang official docker v0.4.8.post1-rocm630 )

I am stuck at installing DeepEP. (git clone deepep and DISABLE_SM90_FEATURES=1 python setup.py install)
example error
/sgl-workspace/DeepEP/csrc/kernels/layout.hip:52:9: error: invalid instruction, did you mean: s_trap? 52 | EP_DEVICE_ASSERT(num_ranks % NUM_MAX_NVL_PEERS == 0 and num_ranks > NUM_MAX_NVL_PEERS); | ^ /sgl-workspace/DeepEP/csrc/kernels/exception_hip.cuh:49:13: note: expanded from macro 'EP_DEVICE_ASSERT' 49 | asm("trap;"); \ | ^ <inline asm>:1:2: note: instantiated into assembly here 1 | trap; |

The reason that I am trying --enable-deepep-moe at RoCM is that from 0.4.8.post1, SGLANG_USE_AITER=1 and --enable-ep-moe will give incorrect result or OOM.
(I see many aiter-commits are done at sglang, but I suspect some unknown commits assumed --enable-deepep-moe, not --enable-ep-moe only)

Yuechguo pushed a commit to Yuechguo/sglang that referenced this pull request Aug 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants