[PD] Add different TP sizes support for no-MLA models #6793

Hongbosherlock · 2025-05-31T16:10:15Z

Motivation

Inspired by #5922, this PR add support for different TP sizes per DP rank for no-MLA models.
This enhancement aims to provide more flexibility in configuring PD setups.

In SGLang, the KV cache is managed in pages. When Prefill and Decode stages have differing TP sizes, the transfer of paged KV cache needs careful handling:

MLA Models or Aligned TP Sizes (Decode TP == Prefill TP ):
The existing KV cache transfer logic remains unchanged and continues to function as before.
Decode TP < Prefill TP (e.g., Prefill TP=4, Decode TP=2 ):
one Decode rank needs to aggregate KV cache data (different head slices) from multiple Prefill ranks.
Decode TP > Prefill TP (e.g., Prefill TP=2, Decode TP=4):
the Prefill rank needs to split its KV cache page data and send the respective sub-slices to multiple target Decode ranks.

Modifications

Add send_kvcache_slice to handle the differing TP size scenarios for non-MLA models.

Lasted Evaluation

prefill tp_size:4 , decode tp_size:2

Accuracy: 0.900

prefill tp_size:2 , decode tp_size:4

Accuracy: 0.910

prefill tp_size:2 , decode tp_size:2

Accuracy: 0.895

prefill-node1 tp_size:2 , prefill-node2 tp_size:4 , decode tp_size:2

Accuracy: 0.910

Tested on DeepSeek-R1-Distill-Qwen-14B using H100 GPUs.

Future Optimizations

Currently, when sending the KV cache for a block of pages, the self.engine.transfer_sync call is made iteratively for each page's slice within that block. It's a potential increase in TTFT.

Future PRs could build upon this initial support to address this performance aspect.

Asynchronous Transfers(transfer_async returning a future/event).

CC:@ShangmingCai @zhyncs @whybeyoung

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Hello @Hongbosherlock, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team,

Gemini here, providing a summary of this pull request to help everyone quickly understand the changes and context.

This PR, authored by Hongbosherlock, introduces support for running SGLang's Prefill-Decode (PD) disaggregation setup with different Tensor Parallel (TP) sizes between the prefill and decode stages, specifically for non-MLA (Multi-Layer Attention) models. Previously, this configuration was only supported for MLA models or when TP sizes were aligned. The core challenge addressed is the correct transfer of paged KV cache data between prefill and decode ranks when their TP configurations differ, requiring careful handling of KV head slices.

The main modification involves adding a new mechanism (send_kvcache_slice) to manage this KV cache transfer for non-MLA models with mismatched TP sizes. This new logic calculates the correct offsets and lengths for transferring the relevant slice of KV cache data per page, based on the TP ranks and world sizes of both the prefill and decode stages. The existing transfer logic is retained for MLA models or when TP sizes match.

The author notes that the current implementation uses synchronous transfers per page slice, which could potentially impact Time To First Token (TTFT), and suggests asynchronous transfers as a future optimization.

Highlights

Support for Different TP Sizes: Adds support for running Prefill-Decode (PD) with different Tensor Parallel (TP) sizes between the prefill and decode stages for non-MLA models.
KV Cache Transfer Logic: Introduces a new function (send_kvcache_slice) to handle the complex transfer of paged KV cache data when prefill and decode TP sizes are not equal, correctly slicing and transferring data based on TP configuration.
Removed Assertions: Removes previous assertions that prevented running non-MLA models with differing prefill and decode TP sizes.
TP Info Registration: Adds the TP rank and world size information to the KV arguments registered by the decode rank, allowing the prefill rank to understand the target TP configuration.

Changelog

python/sglang/srt/disaggregation/mooncake/conn.py
- Added tp_rank_id_in_group and tp_world_size_of_group fields to KVArgsRegisterInfo to store the target decode rank's TP information (lines 111-112, 123-124).
- Implemented the send_kvcache_slice method to handle KV cache transfers when prefill and decode TP sizes differ. This involves calculating head offsets and slice lengths based on TP configurations and using transfer_sync for each page slice (lines 293-413).
- Modified transfer_worker to conditionally call send_kvcache (for MLA or aligned TP) or send_kvcache_slice (for non-MLA with different TP) (lines 492-509).
- Removed assertions that previously restricted different TP sizes to only MLA backends (lines 947-949, 955-957).
- Adjusted the logic for determining is_dummy bootstrap info; for non-MLA, all prefill ranks are considered 'real' (lines 987-995).
- Added sending the TP rank ID and world size during the _register_kv_args process (lines 1070-1073, 1085-1086).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

TP sizes now can roam,
Across the nodes, far from home.
KV cache must flow,
Where the decode ranks go,
A sliced-up, distributed loam.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an important feature: support for different Tensor Parallel (TP) sizes between prefill and decode stages for non-MLA models. This significantly enhances flexibility in configuring disaggregated setups.

The core logic for slicing and dicing the KV cache in the new send_kvcache_slice method appears to correctly handle the two main scenarios (Prefill TP > Decode TP, and Prefill TP <= Decode TP). The calculations for head distribution, offsets, and slice lengths seem robust.

The integration into the existing transfer_worker and the updates to KVArgsRegisterInfo and _register_kv_args are also well-handled.

I have a few suggestions, primarily around improving the clarity and maintainability of the new complex send_kvcache_slice function through more detailed documentation and potentially clearer variable names. Additionally, please ensure that comprehensive unit tests are added to cover the new logic, especially the different TP configurations and edge cases, as indicated in the PR checklist.

Great work on tackling this complex feature!

Summary of Findings

Code Documentation: The new send_kvcache_slice method is complex and would benefit significantly from a detailed docstring explaining its parameters and logic for different TP configurations.
Variable Naming Clarity: Some variable names within send_kvcache_slice (e.g., heads_per_decode_rank, item_len_of_decode_rank_page) could be more descriptive to enhance code readability and maintainability.
Stylistic Issues (Not Commented due to Settings): Minor stylistic issues were found, such as an extra space in a list comprehension (line 320) and example values in comments (lines 364, 365) that should be removed or generalized. These were not added as specific review comments due to the configured severity threshold.

Merge Readiness

The pull request introduces a valuable and complex feature. The core logic appears sound. However, before merging, I recommend addressing the suggestions for improving code clarity and maintainability, particularly by adding a detailed docstring to send_kvcache_slice and considering more explicit variable names.

Crucially, please ensure that comprehensive unit tests are added to validate this new functionality across various TP configurations (Prefill TP > Decode TP, Prefill TP < Decode TP, Prefill TP == Decode TP) and cover potential edge cases. The PR checklist items, especially regarding testing and documentation, should be completed.

Given the medium severity comments and the need for tests, I am requesting changes. I am not authorized to approve pull requests.

python/sglang/srt/disaggregation/mooncake/conn.py

Hongbosherlock · 2025-05-31T16:15:15Z

Examples:
Without DP attention:

prefill tp size > decode tp size

python -m sglang.launch_server --model-path DeepSeek-R1-Distill-Qwen-14B --port 30000 --host 127.0.0.1  --tp-size 4 --trust-remote-code --disaggregation-mode prefill --base-gpu-id 0 

python -m sglang.launch_server --model-path DeepSeek-R1-Distill-Qwen-14B --port 30100 --host 127.0.0.1  --tp-size 2 --trust-remote-code --disaggregation-mode decode --base-gpu-id 4

prefill tp size < decode tp size

python -m sglang.launch_server --model-path DeepSeek-R1-Distill-Qwen-14B--port 30000 --host 127.0.0.1  --tp-size 2 --trust-remote-code --disaggregation-mode prefill  --base-gpu-id 0 

python -m sglang.launch_server --model-path DeepSeek-R1-Distill-Qwen-14B --port 30100 --host 127.0.0.1  --tp-size 4 --trust-remote-code --disaggregation-mode decode --base-gpu-id 4

mini_lb command:

python3 -m sglang.srt.disaggregation.mini_lb --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30100 --host 127.0.0.1 --port 8000

jokerwyt · 2025-06-01T03:19:49Z

Is it possible we support such a function in the base backend? They are likely to be shared by all backends.

ShangmingCai · 2025-06-01T03:28:48Z

Thank you for this PR! Will review it tomorrow and test to see if the bytes slice solution can pass the E2E tests.

Hongbosherlock · 2025-06-01T12:27:35Z

Is it possible we support such a function in the base backend? They are likely to be shared by all backends.

Do you mean support this func for both the Mooncake and NIXL backends?
Will look into it.

Hongbosherlock · 2025-06-01T15:20:20Z

Evaluation

python3 benchmark/gsm8k/bench_sglang.py --port 8000 --parallel 10 --num-questions 200

prefill tp_size:4 , decode tp_size:2

Accuracy: 0.755

Looking into accuracy issue when Decode TP < Prefill TP.

prefill tp_size:2 , decode tp_size:4

Accuracy: 0.910

python/sglang/srt/disaggregation/mooncake/conn.py

ShangmingCai

These newly added args are not straightforward in my opinion,

    tp_rank_id_in_group: int        
    tp_world_size_of_group: int

How about we change it into the

    dst_kv_item_len: int
    dst_tp_slice_index: int

, and we use them to calculate the offset, where dst_tp_slice_index = self.kv_mgr.kv_args.engine_rank % self.required_dst_info_num in the KVReceiver, so that we rettain all rank calculation logics in the decode side.

ShangmingCai

Since the data_addrs are not contiguous anymore because we have to split every item either at the src or the dst, I think it will get a significant performance drop unless we are using a very large page size.

Overall, this is a great PR that verified that we can support different TP with non-MLA models by splitting kvcache bytes. Will do some experiments this week.

Hongbosherlock · 2025-06-02T16:35:18Z

These newly added args are not straightforward in my opinion,
    tp_rank_id_in_group: int        
    tp_world_size_of_group: int 
How about we change it into the
    dst_kv_item_len: int
    dst_tp_slice_index: int 
, and we use them to calculate the offset, where dst_tp_slice_index = self.kv_mgr.kv_args.engine_rank % self.required_dst_info_num in the KVReceiver, so that we rettain all rank calculation logics in the decode side.

Thanks for your time and feedback! I’ll work on the changes shortly.

Hongbosherlock · 2025-06-02T16:42:40Z

prefill tp_size:4 , decode tp_size:2
Accuracy: 0.755
Looking into accuracy issue when Decode TP < Prefill TP.

Each decode rank needs to receive KV cache from multiple prefill ranks. The accuracy problem is likely caused by a decode rank mistakenly assuming that it has received all KV cache after getting data from just one prefill rank.

Potential solution:
We might need a parameter similar to required_dst_info_num, maybe required_prefill_info_num, to indicate how many prefill ranks a decode rank must receive KV cache from before it can safely update its state.

Hongbosherlock · 2025-06-04T08:49:34Z

Fixed the accuracy issue when Decode TP < Prefill TP in 9ba6379
Updated Evaluation

prefill tp_size:4 , decode tp_size:2

Accuracy: 0.900

prefill tp_size:2 , decode tp_size:4

Accuracy: 0.910

prefill tp_size:2 , decode tp_size:2

Accuracy: 0.895

prefill-node1 tp_size:2 , prefill-node2 tp_size:4 , decode tp_size:2

Accuracy: 0.910

Tested on DeepSeek-R1-Distill-Qwen-14B using H100 GPUs.

python/sglang/srt/disaggregation/mooncake/conn.py

Hongbosherlock · 2025-06-04T14:06:45Z

These newly added args are not straightforward in my opinion,
    tp_rank_id_in_group: int        
    tp_world_size_of_group: int 
How about we change it into the
    dst_kv_item_len: int
    dst_tp_slice_index: int 
, and we use them to calculate the offset, where dst_tp_slice_index = self.kv_mgr.kv_args.engine_rank % self.required_dst_info_num in the KVReceiver, so that we rettain all rank calculation logics in the decode side.

Thanks for the suggestion! I tried to refactor it , but found that other parts still require knowledge of the decode TP group rank and size for correct computation (e.g., head offset calculation and condition checks).
To keep the implementation clear and maintainable, I’ve renamed the original arguments to:

dst_tp_rank: int
dst_tp_size: int
dst_kv_item_len: int

minor

ByronHsu

Since most of the change is on mooncake side. I will defer to @ShangmingCai to review/approve

python/sglang/srt/disaggregation/prefill.py

ShangmingCai · 2025-06-12T04:42:52Z

@Hongbosherlock ~~When prefill tp > decode tp (with MLA), the current PR will hang. I think some code hasn't been included yet. The CI and my local test both hang.~~ I have fixed this for you. Let me run another round of local test and CI.

Hongbosherlock · 2025-06-12T06:28:40Z

@Hongbosherlock ~~When prefill tp > decode tp (with MLA), the current PR will hang. I think some code hasn't been included yet. The CI and my local test both hang.~~ I have fixed this for you. Let me run another round of local test and CI.

Thanks!

Hongbosherlock · 2025-06-17T06:53:22Z

Hi @ShangmingCai , It looks like all CI checks have passed. Please let me know if there's anything else I should address — Looking forward to getting this merged!

ShangmingCai · 2025-06-18T03:45:25Z

@Hongbosherlock Hello, after I talk this PR with some maintainers, some people are worrying that the performance is too poor to be used in a real deployment scenario. I am trying this PR: #7236, which might help improve the small KV head piece transfer throughput. Let me try this first to see if we can improve the performance a little bit.

ShangmingCai · 2025-06-23T11:55:08Z

@Hongbosherlock, I have verified that the performance drop can be reduced by #7236. We are able to have it rise from 10% of the original to 70% of the original performance through batching small pieces inside mooncake, which is an acceptable degradation value. I will run more tests, and hopefully we can get these PRs merged this week.

Hongbosherlock · 2025-06-23T12:16:58Z

@Hongbosherlock, I have verified that the performance drop can be reduced by #7236. We are able to have it rise from 10% of the original to 70% of the original performance through batching small pieces inside mooncake, which is an acceptable degradation value. I will run more tests, and hopefully we can get these PRs merged this week.

Thank you for the update! I'm glad to hear that. Please let me know if there's anything else I can assist with. I'll also run some tests and update this PR if necessary.

Hongbosherlock · 2025-06-24T15:06:08Z

@ShangmingCai I have updated the code, and it's really a significant improvement.

for example:

prefill tp_size:4 , decode tp_size:2

# base
Accuracy: 0.905
Invalid: 0.000
Latency: 146.276 s
Output throughput: 189.136 token/s

# new batch transfer
Accuracy: 0.905
Invalid: 0.000
Latency: 33.470 s
Output throughput: 863.560 token/s

ShangmingCai · 2025-06-24T15:31:34Z

@Hongbosherlock Yes, same results. But the batch transfer api requires the latest version and we are seeing some slice-failed reports. I need to address this bug first to release 0.3.4.post2 to make sure PD is runnable under all cases for sglang main. I will trigger the CI after we fix the bug and upgrade the package version, and get this PR merged when all CI passes.

ishandhanani · 2025-07-10T18:06:08Z

@Hongbosherlock would it be possible to also contribute this to NIXL?

Hongbosherlock · 2025-07-13T09:49:31Z

@Hongbosherlock would it be possible to also contribute this to NIXL?

Sure, will submit a PR as soon as possible.

Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com>

@mickqian

* Use seq_len_fill_value in the cuda graph runners (sgl-project#7233) * support custom weight loader for model runner (sgl-project#7122) Co-authored-by: kavioyu <kavioyu@tencent.com> * Fix AMD speculative decoding (sgl-project#7252) * [Refactor] OAI Server components (sgl-project#7167) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * OAI Server Skeleton & Core Utility Endpoints (sgl-project#7179) * [amd] Opt dsv3 moe (sgl-project#7160) Co-authored-by: wunhuang <wunhuang@amd.com> * update ci node for xeon (sgl-project#7265) * feat: mtp support dp-attention (sgl-project#6081) Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: ch-wan <cwan39@gatech.edu> * support qwen2 running on ascend npu device (sgl-project#7022) Co-authored-by: 刁莹煜 <diaoyingyu1@hisilicon.com> * Fix Deepseek R1 0528 FP4 tensor name mismatch issue during weights loading. (sgl-project#7164) * bugfix(tool call ebnf): Fix EBNF generation for optional function parameters (sgl-project#7283) * Fix AWQ Dequant and Weight Loading of deepseek v2 (sgl-project#6842) * fix: resolve b200 dsv3 mtp issue (sgl-project#7286) * ci: Fix test_ebnf_generate_all_optional_function_params (sgl-project#7288) * fix: only enable flash_attn test on sm80 sm90 (sgl-project#7289) * [PD] Support get local ip from NIC for PD disaggregation (sgl-project#7237) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [PD] Add custom memory pool option to support Mooncake PD with NVLink (sgl-project#7264) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Upstreaming hicache bug fixes (sgl-project#7267) * Update python API of activation, topk, norm and rope and remove vllm dependency (sgl-project#6614) Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com> Co-authored-by: jianan-gu <jianan.gu@intel.com> Co-authored-by: sdp <sdp@gnr799219.jf.intel.com> * Fix hicache benchmark script bug - some sampled input_request is [] (sgl-project#7300) * chore: change logs from`INFO` to `DEBUG` for dp and add force quit for tokenizer manager (sgl-project#7251) * update invalid link in doc (sgl-project#7297) * Fix mini_lb for PD with long output: limit chunk size of decode response (sgl-project#7301) Signed-off-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> Co-authored-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> * Fix profiler error when there are idle passes (sgl-project#7003) * [pd] optimize dockerfile for pd disaggregation (sgl-project#7319) Co-authored-by: zhyncs <me@zhyncs.com> * Merge PDLB (Prefill-Decode Load Balancer) into SGLang Router (sgl-project#7096) * Add more refactored openai test & in CI (sgl-project#7284) * fix: resolve blackwell deepep image issue (sgl-project#7331) * add seed in CPU UTs to avoid flaky failure (sgl-project#7333) * Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (sgl-project#7099) * Reintroduce tiny fix sampler error when prob is not contiguous (sgl-project#7354) * [Refactor] Clean up radix cache related API (sgl-project#7303) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * Put `_normalize_rid` before other normalization in `io_struct` (sgl-project#7363) * [PD] Transfer hidden states for mtp when disaggregation (sgl-project#7242) * [Bugfix][PD] Set conclude state before clear when failure happens (sgl-project#7362) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * docs: update installation (sgl-project#7366) * [Docker] optimize dockerfile remove deepep and blackwell merge it to… (sgl-project#7343) Co-authored-by: Yineng Zhang <me@zhyncs.com> * Clean unused import for mimo mtp model (sgl-project#7370) * [Bugfix]Fix hang bug using dp attention with HiRadixCache (sgl-project#7159) Signed-off-by: huanglong <huanglong@linux.alibaba.com> * [Doc] add embedding rerank doc (sgl-project#7364) * Fix judgment condition for enabling Deepseek V3/R1 shared expert fusion optimization (sgl-project#7371) * Feat/refactor embedding server (sgl-project#7322) * Purge VerlEngine (sgl-project#7326) Signed-off-by: Ata Fatahi <immrata@gmail.com> * support return logprobs for pipeline (sgl-project#7356) Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> * [PD] Optimize custom mem pool usage and bump mooncake version (sgl-project#7393) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Support THUDM/GLM-4-0414 (GLM-Z1) Glm4ForCausalLM architecture. (sgl-project#5485) * Refine OpenAI serving entrypoint to remove batch requests (sgl-project#7372) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: Chang Su <csu272@usc.edu> * [Feature] Comprehensive Hybrid Parallelism Support (sgl-project#6389) * [DeepSeekNextN] fix: residual of head norm can be None (sgl-project#7398) * [OAI refactor] Add rerank and score serving (sgl-project#7399) Co-authored-by: Chang Su <chang.s.su@oracle.com> * [OAI Server Refactor] [ChatCompletions & Completions] Implement UsageInfo Processor (sgl-project#7360) Co-authored-by: Chang Su <chang.s.su@oracle.com> * Fix All-Gather under world size one (sgl-project#7219) * Optimize DP attn scheduling for speculative decoding (sgl-project#7285) * Update usage_processor.py (sgl-project#7402) * Fix 7285 Merge Conflicts (sgl-project#7403) * chore: upgrade mooncake-transfer-engine 0.3.4 (sgl-project#7401) * [OAI Server Refactor] [ChatCompletions & Completions] Support Return Hidden State (sgl-project#7329) Signed-off-by: keru <rukeyang@gmail.com> * Remove batches api in docs & example (sgl-project#7400) * [BugFix]: fix EmbeddingReqInput single input error (sgl-project#7396) * [BugFix]fix qwen25 invoke function call streaming responses with curly braces as the starting indicator (sgl-project#7394) * fix overlap pagecount (sgl-project#6984) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * fix: Fix CI test_function_call_parser.py (sgl-project#7425) * Fix CPU offloading for MLA memory pool (sgl-project#7409) * [fix] PD disaggregation when enable mtp and tp!=dp (sgl-project#7420) * feat(oai refactor): Replace `openai_api` with `entrypoints/openai` (sgl-project#7351) Co-authored-by: Jin Pan <jpan236@wisc.edu> * Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support (sgl-project#7412) * refactor(test): reorganize OpenAI test file structure (sgl-project#7408) * [minor] simplify the `TokenToKVPoolAllocator` (sgl-project#7414) * Tiny add logging for GC (sgl-project#7406) * FlashInfer NVFP4 MoE with EP & 2-stream shared expert (sgl-project#7327) Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: alcanderian <alcanderian@gmail.com> * Remove copy after bmm (sgl-project#7441) * Fix torch compile run (sgl-project#7391) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> * [misc] Add PD service discovery support in router (sgl-project#7361) * add fused moe config for qwen3 in triton3.3.1 (sgl-project#7445) * Fix CUDA Graph Check under Deepep with DP FFN (sgl-project#7451) * Update hyperparameter_tuning.md (sgl-project#7454) * feat: integrate deepgemm into EPMoE (sgl-project#6821) Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * Solve docker build failed in the virtual machine (sgl-project#7290) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> Co-authored-by: HAI <hixiao@gmail.com> * Fix a bug in BatchTokenIDOut & Misc style and dependency updates (sgl-project#7457) * [CI] Upgrade mooncake to 0.3.4.post1 to fix 8 gpu tests (sgl-project#7472) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Fix prefill OOM due to wrong token calculation when page > 1 (sgl-project#7397) * feat(func_call): Add more check in `BaseFormatDetector.parse_streaming_increment` (sgl-project#7479) * Fix dtype for idle input in spec decoding (sgl-project#7456) * update mooncake in dockerfile (sgl-project#7480) * kvcache io kernels and test case (sgl-project#7382) * [perf] slightly imporve DeepSeek-R1-FP4 TP8 (sgl-project#7481) * Quick fix for DeepGemm requant to also cover MTP. (sgl-project#7378) * Support weight loading without mmap (sgl-project#7469) * ci: Revert openai_server related tests in AMD suites (sgl-project#7449) * Perormance: Enable cuda graph for dp idle batch (sgl-project#7269) Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: ch-wan <cwan39@gatech.edu> * bugfix: Prevent global mutation of conv.stop_str across requests (sgl-project#7347) Co-authored-by: Chang Su <chang.s.su@oracle.com> * Fix RequestValidationError response format (sgl-project#7487) * Fix MTP with Deepseek R1 Fp4 (sgl-project#7376) * chore: bump sgl-kernel v0.2.0 (sgl-project#7490) * chore: bump v0.4.8 (sgl-project#7493) * [AMD] add aiter fused moe in DeepEP path (sgl-project#7268) * enable aiter_biased_grouped_topk kernel (sgl-project#7423) * [PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236) * Remove cumsum_buffer initilization (sgl-project#7439) * [benchmark] fbgemm benchmark support bandwidth report and support fbgemm_cutlass_gmm (sgl-project#7422) * Support multi-thread model weight loading (sgl-project#7277) * [PD] NIXL: Register kv args in advance and cleanup finished requests (sgl-project#6717) * fix: Add `--model` as an alias for `--model-path` in server_args (sgl-project#7505) * misc: Improvement to serving_chat.py and add more ut (sgl-project#7489) * Fuse sorted_token_ids padding to moe_align_block_size kernel (sgl-project#7437) * [OAI] patch origin request_id logic (sgl-project#7508) * [PD][Spec] Fix hidden state transfer for spec decode (sgl-project#7516) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * EPLB support for MTP (sgl-project#7510) * clean duplicate code (sgl-project#7512) * [ci] add router benchmark script and CI (sgl-project#7498) * fix: force synchronization between TP workers when update_weights (sgl-project#6626) Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> * [CPU] [BF16] Call fused_experts_cpu, weight_packed_linear and bmm_cpu kernel in DeepSeek model (sgl-project#6641) Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> * [CI] Upgrade mooncake to v0.3.4.post2 to fix potential slice failed bug (sgl-project#7522) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * npu fused op (sgl-project#7386) Co-authored-by: Li Junwen <lijunwen13@hisilicon.com> * feat: send kvmetrics from sglang scheduler (sgl-project#6721) * [PD] Add different TP sizes support for no-MLA models (sgl-project#6793) Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com> * enable aiter fp8 blockscale quant (sgl-project#7520) * take aiter get_rope back (sgl-project#7521) * Fix typo of flash_cache (sgl-project#7513) * feat: add return hidden_states at async generation (sgl-project#7507) * minor: 'role' must be system/assistant/tool, but case insensitive for now (sgl-project#7499) * Fix FP8 KV Cache Support in FA3 Backend (sgl-project#7148) * Fix gathered_buffer issues in tbo (sgl-project#7531) * [PD] Raise error for incompatible mooncake version and some minor fixes (sgl-project#7527) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [CMake] Fix sgl-kernel CMakeLists for Blackwell (sgl-project#7543) * Add Tencent HunYuanMoEV1 model support (sgl-project#7549) * Update seed in CPU UTs to avoid flaky failure with single test (sgl-project#7544) * chore: improve ci bug reporting (sgl-project#7542) * chore: remove vlm unnecessary import (sgl-project#7541) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com> * chore: bump v0.4.8.post1 (sgl-project#7559) * [PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND (sgl-project#7330) * [Fix] incorrect assert in EPLB (sgl-project#7575) * Updates Gemma3n MLP layer to adapt latest transformers version (sgl-project#7573) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Fix MTP error when enabling two-batch overlap (sgl-project#7569) * Add e2e test for multi instance multi stage memory release/resume occupuation (sgl-project#7208) Signed-off-by: Ata Fatahi <immrata@gmail.com> * [CI] Add CI Testing for Prefill-Decode Disaggregation with Router (sgl-project#7540) * Updates transformers and timm dependencies (sgl-project#7577) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * feat: support compatibility between MTP and two-batch-overlap (sgl-project#7225) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * Move multimodal processors into a separate folder (sgl-project#7581) * Fix broken CI TestVILAServer (sgl-project#7610) * [router] add centralized configuration module for sgl-router (sgl-project#7588) * Fix: Minicpm (sgl-project#7612) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Hybrid kv cache for LLaMA4 (sgl-project#6563) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: tarinkk <rt572@physics.rutger.edu> Co-authored-by: tarinkk <rt572@rutgers.physics.edu> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> * [CPU] add optimizations for INT8 and FP8 DeepSeek (sgl-project#6769) Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com> * Tiny add logs for expert location updater (sgl-project#7308) * Fix flakiness in LoRA batch test. (sgl-project#7552) * [BUG] fix local_rank in initialize_dp_attention (sgl-project#7584) * Support dynamic LoRA loading / unloading in engine/server API (sgl-project#7446) * [PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated (sgl-project#7598) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * fix unit tests (sgl-project#7618) * Let ep_scatter support arbitrary strides / ue8m0 format (sgl-project#7309) * Let EP prefill support new DeepGEMM (sgl-project#7310) * docs: add gb200 nvl72 and a16z grant (sgl-project#7620) * oai: Adds support for OpenAI chat completions API in bench_serving (sgl-project#7036) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> * [bugfix] Remove PR comment posting from Rust benchmark workflow (sgl-project#7625) * [Minor] clean up multimodal processor and tokenizer manager (sgl-project#7624) * Add dsv3 fused a gemm to sgl-kernel (sgl-project#7630) * Add @mickqian as the CODEOWNERS of multimodal (sgl-project#7636) * Fix stream reasoning parser and Adds Kimi reasoning parser (sgl-project#7432) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Fix sgl-router startup crash (sgl-project#7619) * [bugfix] fix runtime dropping panic in editable (sgl-project#7628) * Move files related to EPLB (sgl-project#7580) * [misc] reduce weird rope_scaling_factor warning (sgl-project#7176) * [AMD] Add unit-test-sgl-kernel-amd to AMD CI (sgl-project#7539) * Update CODEOWNERS (sgl-project#7640) * [EAGLE] remove a wrong adjustment for page_size > 1 & topk > 1 in server_args.py (sgl-project#7643) * [CPU] add c++ kernel to bind CPU cores and memory node (sgl-project#7524) * Improve streaming, log_level, memory report, weight loading, and benchmark script (sgl-project#7632) Co-authored-by: Kan Wu <wukanustc@gmail.com> * Add dsv3 router gemm kernel (sgl-project#7627) * chore: upgrade flashinfer v0.2.7 jit (sgl-project#7663) * [doc] update lws doc for pd (sgl-project#7318) * Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes (sgl-project#7648) * Add small requirements for benchmark/parse_result tools (sgl-project#7671) * [CPU] remove process_group from inputs of shm_allreduce and shm_allgather (sgl-project#7486) * chore: bump sgl-kernel v0.2.1 (sgl-project#7675) * support llama4 eagle3 (sgl-project#6985) Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Shenggui Li <somerlee.9@gmail.com> Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: yizhang2077 <1109276519@qq.com> * Refactor mm processors and Enable mixed modality processing (sgl-project#7629) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * upgrade sgl kernel to 0.2.1 for main (sgl-project#7676) * add description for llama4 eagle3 (sgl-project#7688) * fix(model loader): use safe_open to prevent file handle leaks. (sgl-project#7684) * chore: upgrade flashinfer v0.2.7.post1 (sgl-project#7698) * Improve error handling for requests with unloaded LoRA path(s) (sgl-project#7642) * Apply dsv3_fused_a_gemm kernel (sgl-project#7635) * Fix GPTQMarlinMoE (sgl-project#7697) * [1/n] apply wna16marlin kernel in moe weight only quantization (sgl-project#7683) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: yych0745 <1398089567@qq.com> Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: 弋云 <yiyun.wyt@antgroup.com> Co-authored-by: walker-ai <2398833647@qq.com> * Apply dsv3 router gemm kernel for deepseek-r1 fp4 (sgl-project#7677) * [AMD] Temporarily disable test_no_overlap_scheduler and test_vision_chunked_prefill (sgl-project#7717) * [RL] add --skip-warmup (sgl-project#7416) * [RL] support update_weights_from_distributed with different group and multiple weights (sgl-project#7292) * [router] add --log-level to sgl-router (sgl-project#6512) * [b200] support trt-llm allreduce fuse rms_norm_add kernel (sgl-project#7621) * [CPU] Bind threads and numa node for each TP rank (sgl-project#6549) Co-authored-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * Support non-contiguous query input for extend/decode attention (sgl-project#7462) * Support updating weights at once by stopping all requests (sgl-project#6698) Signed-off-by: Tianyu Zhou <albert.zty@antgroup.com> Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com> * Fix num_tokens_pre_allocated in disaggregation log (sgl-project#7714) * [CPU] [sgl-kernel] set dispatch key of initialize to CatchAll (sgl-project#7734) * [CPU] fix all_reduce and all_gather (sgl-project#6770) Co-authored-by: blzheng <beilei.zheng@intel.com> * fix awq and dsv3 fused gemm compatible (sgl-project#7735) * [CI][Router] Fix bench_one_batch_server for pd router test (sgl-project#7731) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture (sgl-project#7278) Co-authored-by: HydraQYH <QYH820@Outlook.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> * fix dsv3 fused proj check (sgl-project#7738) * Ascend attention backend(PA&MLA) (sgl-project#7722) Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: VDV1985 <vladdv85@mail.ru> * [fix] fix dsv3_router_gemm filter (sgl-project#7750) * [CPU] refine CPU integration code (sgl-project#7647) * [CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size (sgl-project#6771) * support qwen3 dense model dp attention (sgl-project#7681) * [optimize] add two stream norm for qwen3 (sgl-project#7740) Co-authored-by: ispobock <ispobaoke@gmail.com> * feat: use D2D instead of H2H in pp (sgl-project#7673) Co-authored-by: alpha-baby <fujianhao1997@qq.com> * [Bug] add flashinfer bool check for fusedmoe in Qwen moe models (sgl-project#7723) * [fix] put cpu in the first priority in get_device() (sgl-project#7752) * [optimize] fuse renormalize into moe_topk_softmax (sgl-project#7744) Co-authored-by: ispobock <ispobaoke@gmail.com> * chore: bump sgl-kernel 0.2.2 (sgl-project#7755) * fix CI: update native api ipynb (sgl-project#7754) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * fuse renormal into moe topk softmax kernel python code (sgl-project#7751) Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: zhyncs <me@zhyncs.com> * Remove type conversion and fix id map in topk (sgl-project#7759) * Add V2-lite model test (sgl-project#7390) Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com> * refactor llama4 dp attention logic (sgl-project#7729) * fix(docs): fix the broken link in `docs/references/production_metrics.md` (sgl-project#7741) Signed-off-by: rudeigerc <rudeigerc@gmail.com> * [fix] update bench_speculative.py for compatibility (sgl-project#7764) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * Move mem_fraction_static adjustment for multimodal models to `server_args.py` & Fix session control & Other cleanups (sgl-project#7748) * [RL] Add --nccl-port to prevent port conflict (sgl-project#7418) * [RL] add pause and continue generation for async rl training (sgl-project#7419) * [Fix] Alloc return type error (sgl-project#7778) Signed-off-by: Capronir <839972205@qq.com> * [feat] Support EAGLE3 for Qwen (sgl-project#7745) Co-authored-by: 纬杭 <ximing.wxm@antgroup.com> Co-authored-by: zyksir <zyksir@outlook.com> * saving hidden_states.clone() (sgl-project#7705) * [1/n]: add cutlass W4A8 moe kernel for hopper architecture (sgl-project#7772) Signed-off-by: yangsijia.614 <yangsijia.614@bytedance.com> Co-authored-by: yicwang <yichen.wang@bytedance.com> * add model: qwen2-audio (sgl-project#7596) * Optimize Hopper CUTLASS FP8 Blockwise Grouped GEMM Kernel in Small K Scenario (sgl-project#7782) * Embedding parallel by attn_tp (sgl-project#7623) * fix: fix apply_shuffle_mul_sum (sgl-project#7444) * chore: bump sgl-kernel v0.2.3 (sgl-project#7784) * fix: use nvidia-nccl-cu12 2.27.5 (sgl-project#7787) * DP Attention with Auto DeepEP Dispatch (sgl-project#7222) * chore: upgrade sgl-kernel v0.2.3 (sgl-project#7786) * Fix incorrect spec_num_draft_tokens in draft_extend (sgl-project#7757) * [fix] fix misusing of is_cuda (sgl-project#7790) * Add treemask mode to build_eagle_tree & release sgl-kernel 0.2.3 (sgl-project#7756) Co-authored-by: Pranjal Shankhdhar <pranjal.ssh@gmail.com> * chore: bump sgl-kernel v0.2.4 (sgl-project#7800) * ci: fix port args (sgl-project#7792) * Fix CI test OOM issue. (sgl-project#7799) * chore: upgrade sgl-kernel v0.2.4 (sgl-project#7801) * chore: bump v0.4.9 (sgl-project#7802) * fix merge conflict issue * fix hpu attention nonetyep issue * fix alignment * fix alignment2 * Ci failure fixes * fix attention-backend choices --------- Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> Signed-off-by: huanglong <huanglong@linux.alibaba.com> Signed-off-by: Ata Fatahi <immrata@gmail.com> Signed-off-by: keru <rukeyang@gmail.com> Signed-off-by: Tianyu Zhou <albert.zty@antgroup.com> Signed-off-by: rudeigerc <rudeigerc@gmail.com> Signed-off-by: Kay Yan <kay.yan@daocloud.io> Signed-off-by: Capronir <839972205@qq.com> Signed-off-by: yangsijia.614 <yangsijia.614@bytedance.com> Signed-off-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: KavioYu <67678385+yukavio@users.noreply.github.com> Co-authored-by: kavioyu <kavioyu@tencent.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com> Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com> Co-authored-by: u4lr451 <u4lr451@gmail.com> Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: ch-wan <cwan39@gatech.edu> Co-authored-by: Yijie Zhu <762412795@qq.com> Co-authored-by: 刁莹煜 <diaoyingyu1@hisilicon.com> Co-authored-by: Charles Chen <pychen96@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com> Co-authored-by: jianan-gu <jianan.gu@intel.com> Co-authored-by: sdp <sdp@gnr799219.jf.intel.com> Co-authored-by: Binyao Jiang <byjiang1996@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: linzhuo <15313137931lz@gmail.com> Co-authored-by: ch-tiger1 <tiger@ch-tech.ip-ddns.com> Co-authored-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com> Co-authored-by: Atream <80757050+Atream@users.noreply.github.com> Co-authored-by: Li Hui <lambert80.ios@gmail.com> Co-authored-by: Huang Long <121648372+LLLL114@users.noreply.github.com> Co-authored-by: woodx <124784234+woodx9@users.noreply.github.com> Co-authored-by: Ata Fatahi <immrata@gmail.com> Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com> Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> Co-authored-by: Wenbo Yang <solrex@users.noreply.github.com> Co-authored-by: Chang Su <csu272@usc.edu> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Keyang Ru <rukeyang@gmail.com> Co-authored-by: ehuaa <ehuamail@163.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Lifu Huang <lifu.hlf@gmail.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: alcanderian <alcanderian@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: xutizhou <xutingz@nvidia.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Yuhong Guo <guoyuhong1985@outlook.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: Alex Sun <alex.s@amd.com> Co-authored-by: valarLip <103567126+valarLip@users.noreply.github.com> Co-authored-by: Francis <38564764+ssssnow@users.noreply.github.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: xianzhiT <xianzhitang@tencent.com> Co-authored-by: yilian49 <43861414+yilian49@users.noreply.github.com> Co-authored-by: DangKai <dangkai4u@outlook.com> Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> Co-authored-by: ll819214 <18801269230@163.com> Co-authored-by: Li Junwen <lijunwen13@hisilicon.com> Co-authored-by: zixuanzhang226 <zixuanzhang@bytedance.com> Co-authored-by: Hongbo Xu <1320612015@qq.com> Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-authored-by: Meng, Peng <pengmeng@tencent.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com> Co-authored-by: tarinkk <rt572@physics.rutger.edu> Co-authored-by: tarinkk <rt572@rutgers.physics.edu> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com> Co-authored-by: Sheng Qi <shengqi2018@pku.edu.cn> Co-authored-by: finetune <82650881+finetunej@users.noreply.github.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: Kan Wu <wukanustc@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: narutolhy <582909902@qq.com> Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Shenggui Li <somerlee.9@gmail.com> Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: Simon_CQK <cqk0100@gmail.com> Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: yych0745 <1398089567@qq.com> Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: 弋云 <yiyun.wyt@antgroup.com> Co-authored-by: walker-ai <2398833647@qq.com> Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com> Co-authored-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> Co-authored-by: Albert <albert.zty@antgroup.com> Co-authored-by: Ziming Huang <1520787127@qq.com> Co-authored-by: ayrnb <70835312+ayrnb@users.noreply.github.com> Co-authored-by: HydraQYH <QYH820@Outlook.com> Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: VDV1985 <vladdv85@mail.ru> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: TianyuZhang1214 <tianyuzhang1214@163.com> Co-authored-by: alpha-baby <fujianhao1997@qq.com> Co-authored-by: Yuchen Cheng <rudeigerc@gmail.com> Co-authored-by: Kay Yan <kay.yan@daocloud.io> Co-authored-by: Caproni <40862361+Capronir@users.noreply.github.com> Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com> Co-authored-by: 纬杭 <ximing.wxm@antgroup.com> Co-authored-by: zyksir <zyksir@outlook.com> Co-authored-by: SijiaYang <yangsijia.614@bytedance.com> Co-authored-by: yicwang <yichen.wang@bytedance.com> Co-authored-by: Leng Yue <lengyue@lengyue.me> Co-authored-by: Qi Yuhang <45795032+HydraQYH@users.noreply.github.com> Co-authored-by: Gang Chen <13298548+MoonBall@users.noreply.github.com> Co-authored-by: Pranjal Shankhdhar <pranjal.ssh@gmail.com> Co-authored-by: jay <jthakur@habana.ai>

Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com>

Hongbosherlock and others added 6 commits May 29, 2025 00:49

support decode_tp > prefill_tp

6ed77eb

Merge branch 'sgl-project:main' into support_tp

9294eba

support prefill_tp > decode_tp

9889917

support different tp for no-MLA

3b38371

Merge branch 'sgl-project:main' into support_tp

72c5e08

clean code

2ab0a14

Hongbosherlock requested review from hnyls2002 and ByronHsu as code owners May 31, 2025 16:10

gemini-code-assist bot reviewed May 31, 2025

View reviewed changes

gemini-code-assist bot suggested changes May 31, 2025

View reviewed changes

python/sglang/srt/disaggregation/mooncake/conn.py Outdated Show resolved Hide resolved

python/sglang/srt/disaggregation/mooncake/conn.py Outdated Show resolved Hide resolved

ShangmingCai reviewed Jun 2, 2025

View reviewed changes

python/sglang/srt/disaggregation/mooncake/conn.py Outdated Show resolved Hide resolved

ShangmingCai reviewed Jun 2, 2025

View reviewed changes

python/sglang/srt/disaggregation/mooncake/conn.py Outdated Show resolved Hide resolved

ShangmingCai reviewed Jun 2, 2025

View reviewed changes

python/sglang/srt/disaggregation/mooncake/conn.py Show resolved Hide resolved

ShangmingCai reviewed Jun 2, 2025

View reviewed changes

Hongbosherlock and others added 2 commits June 2, 2025 23:32

optimize:remove group_concurrent_contiguous

3c8581f

Merge branch 'sgl-project:main' into support_tp

c80399f

fix Prefill TP > Decode TP

9ba6379

ShangmingCai reviewed Jun 4, 2025

View reviewed changes

python/sglang/srt/disaggregation/mooncake/conn.py Outdated Show resolved Hide resolved

ShangmingCai reviewed Jun 4, 2025

View reviewed changes

python/sglang/srt/disaggregation/mooncake/conn.py Outdated Show resolved Hide resolved

Improve code robustness

01c3387

ShangmingCai added 2 commits June 11, 2025 19:13

Update conn.py

2293eab

minor

Merge branch 'main' into support_tp

886bd68

ByronHsu reviewed Jun 12, 2025

View reviewed changes

python/sglang/srt/disaggregation/prefill.py Show resolved Hide resolved

Hongbosherlock added 3 commits June 12, 2025 12:05

update kvargs

efbb101

Merge branch 'main' into support_tp

4567b28

updates

a55245a

Hongbosherlock force-pushed the support_tp branch from d0357d0 to a55245a Compare June 12, 2025 04:19

resolve conflict

000ccdb

Hongbosherlock force-pushed the support_tp branch from 869eae8 to 000ccdb Compare June 16, 2025 03:28

Hongbosherlock and others added 5 commits June 23, 2025 20:18

Merge branch 'sgl-project:main' into support_tp

bc2690b

Merge branch 'main' into support_tp

ba65ef8

Fix sync status

6eb9634

Merge branch 'main' into support_tp

484a717

add batch transfer

6eda16d

Merge branch 'main' into support_tp

f5bc385

zhyncs merged commit e21aa1d into sgl-project:main Jun 25, 2025

ShangmingCai mentioned this pull request Jun 27, 2025

[Roadmap] Prefill and Decoding Disaggregation #4655

Open

13 tasks

chenxijun1029 pushed a commit to chenxijun1029/sglang that referenced this pull request Jul 17, 2025

[PD] Add different TP sizes support for no-MLA models (sgl-project#6793)

72a13bb

Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com>

shuaills pushed a commit to shuaills/sglang that referenced this pull request Jul 21, 2025

[PD] Add different TP sizes support for no-MLA models (sgl-project#6793)

e5afc34

Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com>

[PD] Add different TP sizes support for no-MLA models #6793

[PD] Add different TP sizes support for no-MLA models #6793

Uh oh!

Conversation

Hongbosherlock commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Lasted Evaluation

Future Optimizations

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

Hongbosherlock commented May 31, 2025

Uh oh!

jokerwyt commented Jun 1, 2025

Uh oh!

ShangmingCai commented Jun 1, 2025

Uh oh!

Hongbosherlock commented Jun 1, 2025

Uh oh!

Hongbosherlock commented Jun 1, 2025

Evaluation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

Hongbosherlock commented Jun 2, 2025

Uh oh!

Hongbosherlock commented Jun 2, 2025

Uh oh!

Hongbosherlock commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Hongbosherlock commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ByronHsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ShangmingCai commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hongbosherlock commented Jun 12, 2025

Uh oh!

Hongbosherlock commented Jun 17, 2025

Uh oh!

ShangmingCai commented Jun 18, 2025

Uh oh!

ShangmingCai commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hongbosherlock commented Jun 23, 2025

Uh oh!

Hongbosherlock commented Jun 24, 2025

Uh oh!

ShangmingCai commented Jun 24, 2025

Hongbosherlock commented May 31, 2025 •

edited

Loading

Hongbosherlock commented Jun 4, 2025 •

edited

Loading

Hongbosherlock commented Jun 4, 2025 •

edited

Loading

ShangmingCai commented Jun 12, 2025 •

edited

Loading

ShangmingCai commented Jun 23, 2025 •

edited

Loading