[Feature] Comprehensive Hybrid Parallelism Support #6389

ch-wan · 2025-05-18T08:04:00Z

Motivation

The current parallelism flags in SGLang are not fully compatible. For example, setting --moe-dense-tp-size as 1 is not compatible with --enable-ep-moe, which is required by many users (e.g., #6297, #7041, #7055). This PR made the following change to make the existing parallelism flags (i.e., --moe-dense-tp-size is 1 or None, --enable-dp-attention with dp=tp or dp<tp, --enable-ep-moe, --enable-deepep-moe, --enable-dp-lm-head, speculative decoding) comparible.

refactored existing parallelism control flow by introducing require_mlp_gather_buffer and require_attn_tp_gather_buffer.

fixed the issue when gathered_buffer is not created correctly created for EP/TP MoE when DP FFN and DP LM Head is in use.

sglang/python/sglang/srt/managers/scheduler.py

Lines 1882 to 1884 in 794be55

    
           if moe_dense_tp_size == 1 and global_server_args_dict["enable_dp_lm_head"]: 
        
               local_batch.global_num_tokens = [num_tokens] 
        
               local_batch.global_num_tokens_for_logprob = [num_tokens_for_logprob]

implemented one TODO that reduces gathered_buffer under pure TP.

sglang/python/sglang/srt/model_executor/cuda_graph_runner.py

Line 303 in 794be55

# TODO(ch-wan): SP layernorm should use a different logic to manage gathered_buffer

fixed one incorrect control logic in the communicator. Here local_attn_dp_size should be replaced with attn_dp_size for controlling dp_gather.

sglang/python/sglang/srt/layers/communicator.py

Lines 388 to 398 in 794be55

    
           if context.local_attn_dp_size != 1: 
        
               if context.attn_tp_rank == 0: 
        
                   hidden_states += residual 
        
               hidden_states, local_hidden_states = ( 
        
                   forward_batch.gathered_buffer, 
        
                   hidden_states, 
        
               ) 
        
               dp_gather_partial(hidden_states, local_hidden_states, forward_batch) 
        
               dp_scatter(residual, hidden_states, forward_batch) 
        
               if hidden_states.shape[0] != 0: 
        
                   hidden_states = layernorm(hidden_states)

added 60 test cases for comprehensive test.

Some fixes are separated in other PRs: #6378, #7219, #7398. These PRs should be merged together to resolve the compatibility issue.

…ther_before_attn

…x-6297

whybeyoung · 2025-06-19T07:58:45Z

LGTM

ch-wan · 2025-06-20T20:54:59Z

@gemini-code-assist

gemini-code-assist · 2025-06-20T20:55:22Z

Hello @ch-wan! You've tagged me. I'm here to assist with this pull request. Would you like a summary of the changes, or perhaps a re-review if there have been recent updates?

ch-wan · 2025-06-20T20:56:51Z

/gemini review

gemini-code-assist

Code Review

The pull request introduces comprehensive hybrid parallelism support by refactoring existing parallelism control flow and fixing compatibility issues between different parallelism flags. It replaces local_attn_dp_size with attn_dp_size for communicator logic and updates conditional logic based on require_mlp_tp_gather. Several suggestions were made to improve code clarity and consistency by removing unused functions and ensuring consistent naming conventions.

gemini-code-assist · 2025-06-20T20:58:29Z

python/sglang/srt/layers/communicator.py

@@ -28,9 +28,9 @@
    attn_tp_reduce_scatter,
    dp_gather_partial,
    dp_scatter,
+    get_attention_dp_size,
    get_attention_tp_rank,
    get_attention_tp_size,


Consider removing get_local_attention_dp_size since it's no longer used after the change.

gemini-code-assist · 2025-06-20T20:58:30Z

python/sglang/srt/layers/communicator.py

@@ -239,7 +239,7 @@
    def init_new(cls):
        attn_tp_rank = get_attention_tp_rank()
        attn_tp_size = get_attention_tp_size()
-        local_attn_dp_size = get_local_attention_dp_size()
+        attn_dp_size = get_attention_dp_size()


Use attn_dp_size instead of local_attn_dp_size to initialize the CommunicateContext.

attn_tp_rank = get_attention_tp_rank() attn_tp_size = get_attention_tp_size() attn_dp_size = get_attention_dp_size()

gemini-code-assist · 2025-06-20T20:58:30Z

python/sglang/srt/layers/communicator.py

@@ -251,7 +251,7 @@
            process_group_sizes=process_group_sizes,
            attn_tp_rank=attn_tp_rank,
            attn_tp_size=attn_tp_size,
-            local_attn_dp_size=local_attn_dp_size,
+            attn_dp_size=attn_dp_size,


Use attn_dp_size instead of local_attn_dp_size to initialize the CommunicateContext.

process_group_sizes=process_group_sizes, attn_tp_rank=attn_tp_rank, attn_tp_size=attn_tp_size, attn_dp_size=attn_dp_size,

gemini-code-assist · 2025-06-20T20:58:30Z

python/sglang/srt/layers/logits_processor.py

+    get_attention_dp_rank,
    get_attention_dp_size,
    get_attention_tp_size,
-    get_local_attention_dp_rank,
    get_local_attention_dp_size,


Consider removing get_local_attention_dp_rank and get_local_attention_dp_size since they are no longer used after the change.

ch-wan · 2025-06-20T21:05:24Z

/gemini summary

gemini-code-assist · 2025-06-20T21:06:13Z

Summary of Changes

This pull request significantly enhances the SGLang framework's support for hybrid parallelism configurations. It addresses existing incompatibilities between various parallelism flags, such as Mixture-of-Experts (MoE), Tensor Parallelism (TP), and Data Parallelism (DP), by refactoring the underlying control flow and buffer management. The changes aim to provide a more robust and flexible system for combining different parallelism strategies, ensuring correct behavior and efficient resource utilization across diverse model architectures and deployment scenarios.

Highlights

Refactored Parallelism Control Flow: Introduced new utility functions (require_mlp_sync, require_mlp_tp_gather, require_attn_tp_gather, require_gathered_buffer) to centralize and clarify the logic for managing parallelism-related operations and buffer requirements across the system.
Improved Buffer Management: Fixed issues where gathered_buffer was not correctly created for certain MoE and Data Parallelism (DP) configurations, and optimized its usage under pure Tensor Parallelism (TP) setups, leading to more efficient memory allocation.
Corrected Communication Logic: Addressed an incorrect control logic in the communicator by consistently using attn_dp_size instead of local_attn_dp_size for data parallelism gathering operations, ensuring correct data synchronization.
Generalized Parallelism Preparation: Renamed and updated prepare_dp_attn_batch functions to prepare_mlp_sync_batch across the scheduler and disaggregation components, reflecting a broader scope of synchronization for MLP (Multi-Layer Perceptron) operations.
Removed Redundant Flags: Eliminated the enable_sp_layernorm flag from server_args.py, as its functionality is now implicitly handled by the new generalized parallelism control logic, simplifying configuration.
Comprehensive Testing: Added 60 new test cases to ensure the robustness and correctness of the updated hybrid parallelism support across various configurations.

Changelog

python/sglang/bench_one_batch.py
- Imported new utility functions: require_mlp_sync and require_mlp_tp_gather (L74-75).
- Renamed _maybe_prepare_dp_attn_batch to _maybe_prepare_mlp_sync_batch and updated its call sites (L246, L260).
- Modified _maybe_prepare_mlp_sync_batch to use require_mlp_sync for conditional execution and require_mlp_tp_gather for parameter passing, removing moe_dense_tp_size (L269, L279).
python/sglang/srt/disaggregation/decode.py
- Imported require_mlp_sync (L57).
- Replaced prepare_dp_attn_flag with prepare_mlp_sync_flag derived from require_mlp_sync (L649).
- Updated calls to prepare_dp_attn_batch to prepare_mlp_sync_batch (L662, L714, L760, L765).
python/sglang/srt/disaggregation/prefill.py
- Imported require_mlp_sync (L48).
- Replaced conditional checks involving enable_dp_attention or enable_sp_layernorm with require_mlp_sync (L278, L312).
- Updated calls to prepare_dp_attn_batch to prepare_mlp_sync_batch (L279, L313).
python/sglang/srt/layers/communicator.py
- Replaced import of get_local_attention_dp_size with get_attention_dp_size (L31).
- Renamed local_attn_dp_size to attn_dp_size in CommunicateContext class definition (L232) and its initialization (L242, L254).
- Updated the conditional check from context.local_attn_dp_size != 1 to context.attn_dp_size != 1 (L388).
python/sglang/srt/layers/dp_attention.py
- Modified get_dp_local_info to use get_attention_dp_rank() instead of get_local_attention_dp_rank() (L169).
python/sglang/srt/layers/logits_processor.py
- Removed import of get_local_attention_dp_rank (L35).
- Modified compute_dp_attention_metadata to use get_attention_dp_rank() instead of get_local_attention_dp_rank() (L174).
python/sglang/srt/managers/scheduler.py
- Imported new utility functions: require_mlp_sync and require_mlp_tp_gather (L151-152).
- Replaced conditional checks involving enable_dp_attention or enable_sp_layernorm with require_mlp_sync (L1439).
- Renamed prepare_dp_attn_batch to prepare_mlp_sync_batch (L1750).
- Renamed prepare_dp_attn_batch_raw to prepare_mlp_sync_batch_raw (L1767).
- Removed moe_dense_tp_size parameter from prepare_mlp_sync_batch_raw and added require_mlp_tp_gather (L1770, L1779).
- Updated the gathered_buffer condition from moe_dense_tp_size == 1 and global_server_args_dict["enable_dp_lm_head"] to not require_mlp_tp_gather (L1854).
python/sglang/srt/model_executor/cuda_graph_runner.py
- Imported new utility functions: require_attn_tp_gather, require_gathered_buffer, require_mlp_tp_gather (L49-51).
- Replaced enable_dp_attention and enable_sp_layernorm attributes with require_gathered_buffer, require_mlp_tp_gather, require_attn_tp_gather (L213-215).
- Refactored gathered_buffer and global_num_tokens_gpu initialization logic based on require_mlp_tp_gather and require_attn_tp_gather to correctly size the buffers (L306-327).
- Updated can_run, capture_one_batch_size, and replay_prepare to use the new require_mlp_tp_gather and require_gathered_buffer flags (L339, L476, L489, L634, L669).
python/sglang/srt/models/deepseek_v2.py
- Removed self.dp_size = get_local_attention_dp_size() from DeepseekV2Model (L1624) and DeepseekV2ForCausalLM (L1709).
python/sglang/srt/server_args.py
- Removed self.enable_sp_layernorm attribute and its related logic (L390, L400-402).
python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py
- Imported new utility functions: require_attn_tp_gather, require_gathered_buffer, require_mlp_tp_gather (L23-26).
- Replaced enable_dp_attention and enable_sp_layernorm attributes with require_gathered_buffer, require_mlp_tp_gather, require_attn_tp_gather (L47-49).
- Refactored gathered_buffer, global_num_tokens_gpu, and global_num_tokens_for_logprob_gpu initialization logic based on require_mlp_tp_gather and require_attn_tp_gather (L97-117).
- Updated can_run, capture_one_batch_size, and replay to use the new require_mlp_tp_gather and require_gathered_buffer flags (L129, L167, L191, L291, L318).
python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py
- Imported new utility functions: require_attn_tp_gather, require_gathered_buffer, require_mlp_tp_gather (L25-28).
- Replaced enable_dp_attention and enable_sp_layernorm attributes with require_gathered_buffer, require_mlp_tp_gather, require_attn_tp_gather (L43-45).
- Refactored gathered_buffer, global_num_tokens_gpu, and global_num_tokens_for_logprob_gpu initialization logic based on require_mlp_tp_gather and require_attn_tp_gather (L101-121).
- Updated can_run, capture_one_batch_size, and replay to use the new require_mlp_tp_gather and require_gathered_buffer flags (L132, L175, L199, L303, L332).
python/sglang/srt/utils.py
- Added four new utility functions: require_mlp_tp_gather, require_attn_tp_gather, require_gathered_buffer, and require_mlp_sync. These functions encapsulate the logic for determining when specific parallelism-related buffers or synchronization steps are needed based on server_args (L2306-2348).

Activity

@whybeyoung provided an LGTM approval (Comment 1).
@ch-wan tagged the gemini-code-assist[bot] for a review (Comment 2, 4) and then requested a summary (Comment 5).
The gemini-code-assist[bot] provided several review comments, primarily suggesting the removal of unused get_local_attention_dp_size and get_local_attention_dp_rank functions, and advising the consistent use of attn_dp_size in CommunicateContext initialization (Review Comment Threads 2159658271, 2159658272, 2159658275, 2159658279).

Edenzzzz · 2025-06-21T03:40:18Z

python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py

                self.gathered_buffer = torch.zeros(
                    (
                        self.max_num_token,
                        self.model_runner.model_config.hidden_size,
                    ),
                    dtype=self.model_runner.dtype,
                )
-                self.global_num_tokens_gpu = torch.zeros(
-                    (self.dp_size,), dtype=torch.int32
-                )
-                self.global_num_tokens_for_logprob_gpu = torch.zeros(
-                    (self.dp_size,), dtype=torch.int32
-                )
-
+                if self.require_mlp_tp_gather:
+                    self.global_num_tokens_gpu = torch.zeros(
+                        (self.dp_size,), dtype=torch.int32
+                    )
+                    self.global_num_tokens_for_logprob_gpu = torch.zeros(
+                        (self.dp_size,), dtype=torch.int32
+                    )
+                else:
+                    assert self.require_attn_tp_gather
+                    self.global_num_tokens_gpu = torch.zeros((1,), dtype=torch.int32)
+                    self.global_num_tokens_for_logprob_gpu = torch.zeros(
+                        (1,), dtype=torch.int32
+                    )


Could we use torch.empty for all these to reduce overhead?

It's in init, so the overhead is negligible.

@mickqian

* Use seq_len_fill_value in the cuda graph runners (sgl-project#7233) * support custom weight loader for model runner (sgl-project#7122) Co-authored-by: kavioyu <kavioyu@tencent.com> * Fix AMD speculative decoding (sgl-project#7252) * [Refactor] OAI Server components (sgl-project#7167) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * OAI Server Skeleton & Core Utility Endpoints (sgl-project#7179) * [amd] Opt dsv3 moe (sgl-project#7160) Co-authored-by: wunhuang <wunhuang@amd.com> * update ci node for xeon (sgl-project#7265) * feat: mtp support dp-attention (sgl-project#6081) Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: ch-wan <cwan39@gatech.edu> * support qwen2 running on ascend npu device (sgl-project#7022) Co-authored-by: 刁莹煜 <diaoyingyu1@hisilicon.com> * Fix Deepseek R1 0528 FP4 tensor name mismatch issue during weights loading. (sgl-project#7164) * bugfix(tool call ebnf): Fix EBNF generation for optional function parameters (sgl-project#7283) * Fix AWQ Dequant and Weight Loading of deepseek v2 (sgl-project#6842) * fix: resolve b200 dsv3 mtp issue (sgl-project#7286) * ci: Fix test_ebnf_generate_all_optional_function_params (sgl-project#7288) * fix: only enable flash_attn test on sm80 sm90 (sgl-project#7289) * [PD] Support get local ip from NIC for PD disaggregation (sgl-project#7237) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [PD] Add custom memory pool option to support Mooncake PD with NVLink (sgl-project#7264) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Upstreaming hicache bug fixes (sgl-project#7267) * Update python API of activation, topk, norm and rope and remove vllm dependency (sgl-project#6614) Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com> Co-authored-by: jianan-gu <jianan.gu@intel.com> Co-authored-by: sdp <sdp@gnr799219.jf.intel.com> * Fix hicache benchmark script bug - some sampled input_request is [] (sgl-project#7300) * chore: change logs from`INFO` to `DEBUG` for dp and add force quit for tokenizer manager (sgl-project#7251) * update invalid link in doc (sgl-project#7297) * Fix mini_lb for PD with long output: limit chunk size of decode response (sgl-project#7301) Signed-off-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> Co-authored-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> * Fix profiler error when there are idle passes (sgl-project#7003) * [pd] optimize dockerfile for pd disaggregation (sgl-project#7319) Co-authored-by: zhyncs <me@zhyncs.com> * Merge PDLB (Prefill-Decode Load Balancer) into SGLang Router (sgl-project#7096) * Add more refactored openai test & in CI (sgl-project#7284) * fix: resolve blackwell deepep image issue (sgl-project#7331) * add seed in CPU UTs to avoid flaky failure (sgl-project#7333) * Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (sgl-project#7099) * Reintroduce tiny fix sampler error when prob is not contiguous (sgl-project#7354) * [Refactor] Clean up radix cache related API (sgl-project#7303) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * Put `_normalize_rid` before other normalization in `io_struct` (sgl-project#7363) * [PD] Transfer hidden states for mtp when disaggregation (sgl-project#7242) * [Bugfix][PD] Set conclude state before clear when failure happens (sgl-project#7362) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * docs: update installation (sgl-project#7366) * [Docker] optimize dockerfile remove deepep and blackwell merge it to… (sgl-project#7343) Co-authored-by: Yineng Zhang <me@zhyncs.com> * Clean unused import for mimo mtp model (sgl-project#7370) * [Bugfix]Fix hang bug using dp attention with HiRadixCache (sgl-project#7159) Signed-off-by: huanglong <huanglong@linux.alibaba.com> * [Doc] add embedding rerank doc (sgl-project#7364) * Fix judgment condition for enabling Deepseek V3/R1 shared expert fusion optimization (sgl-project#7371) * Feat/refactor embedding server (sgl-project#7322) * Purge VerlEngine (sgl-project#7326) Signed-off-by: Ata Fatahi <immrata@gmail.com> * support return logprobs for pipeline (sgl-project#7356) Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> * [PD] Optimize custom mem pool usage and bump mooncake version (sgl-project#7393) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Support THUDM/GLM-4-0414 (GLM-Z1) Glm4ForCausalLM architecture. (sgl-project#5485) * Refine OpenAI serving entrypoint to remove batch requests (sgl-project#7372) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: Chang Su <csu272@usc.edu> * [Feature] Comprehensive Hybrid Parallelism Support (sgl-project#6389) * [DeepSeekNextN] fix: residual of head norm can be None (sgl-project#7398) * [OAI refactor] Add rerank and score serving (sgl-project#7399) Co-authored-by: Chang Su <chang.s.su@oracle.com> * [OAI Server Refactor] [ChatCompletions & Completions] Implement UsageInfo Processor (sgl-project#7360) Co-authored-by: Chang Su <chang.s.su@oracle.com> * Fix All-Gather under world size one (sgl-project#7219) * Optimize DP attn scheduling for speculative decoding (sgl-project#7285) * Update usage_processor.py (sgl-project#7402) * Fix 7285 Merge Conflicts (sgl-project#7403) * chore: upgrade mooncake-transfer-engine 0.3.4 (sgl-project#7401) * [OAI Server Refactor] [ChatCompletions & Completions] Support Return Hidden State (sgl-project#7329) Signed-off-by: keru <rukeyang@gmail.com> * Remove batches api in docs & example (sgl-project#7400) * [BugFix]: fix EmbeddingReqInput single input error (sgl-project#7396) * [BugFix]fix qwen25 invoke function call streaming responses with curly braces as the starting indicator (sgl-project#7394) * fix overlap pagecount (sgl-project#6984) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * fix: Fix CI test_function_call_parser.py (sgl-project#7425) * Fix CPU offloading for MLA memory pool (sgl-project#7409) * [fix] PD disaggregation when enable mtp and tp!=dp (sgl-project#7420) * feat(oai refactor): Replace `openai_api` with `entrypoints/openai` (sgl-project#7351) Co-authored-by: Jin Pan <jpan236@wisc.edu> * Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support (sgl-project#7412) * refactor(test): reorganize OpenAI test file structure (sgl-project#7408) * [minor] simplify the `TokenToKVPoolAllocator` (sgl-project#7414) * Tiny add logging for GC (sgl-project#7406) * FlashInfer NVFP4 MoE with EP & 2-stream shared expert (sgl-project#7327) Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: alcanderian <alcanderian@gmail.com> * Remove copy after bmm (sgl-project#7441) * Fix torch compile run (sgl-project#7391) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> * [misc] Add PD service discovery support in router (sgl-project#7361) * add fused moe config for qwen3 in triton3.3.1 (sgl-project#7445) * Fix CUDA Graph Check under Deepep with DP FFN (sgl-project#7451) * Update hyperparameter_tuning.md (sgl-project#7454) * feat: integrate deepgemm into EPMoE (sgl-project#6821) Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * Solve docker build failed in the virtual machine (sgl-project#7290) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> Co-authored-by: HAI <hixiao@gmail.com> * Fix a bug in BatchTokenIDOut & Misc style and dependency updates (sgl-project#7457) * [CI] Upgrade mooncake to 0.3.4.post1 to fix 8 gpu tests (sgl-project#7472) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Fix prefill OOM due to wrong token calculation when page > 1 (sgl-project#7397) * feat(func_call): Add more check in `BaseFormatDetector.parse_streaming_increment` (sgl-project#7479) * Fix dtype for idle input in spec decoding (sgl-project#7456) * update mooncake in dockerfile (sgl-project#7480) * kvcache io kernels and test case (sgl-project#7382) * [perf] slightly imporve DeepSeek-R1-FP4 TP8 (sgl-project#7481) * Quick fix for DeepGemm requant to also cover MTP. (sgl-project#7378) * Support weight loading without mmap (sgl-project#7469) * ci: Revert openai_server related tests in AMD suites (sgl-project#7449) * Perormance: Enable cuda graph for dp idle batch (sgl-project#7269) Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: ch-wan <cwan39@gatech.edu> * bugfix: Prevent global mutation of conv.stop_str across requests (sgl-project#7347) Co-authored-by: Chang Su <chang.s.su@oracle.com> * Fix RequestValidationError response format (sgl-project#7487) * Fix MTP with Deepseek R1 Fp4 (sgl-project#7376) * chore: bump sgl-kernel v0.2.0 (sgl-project#7490) * chore: bump v0.4.8 (sgl-project#7493) * [AMD] add aiter fused moe in DeepEP path (sgl-project#7268) * enable aiter_biased_grouped_topk kernel (sgl-project#7423) * [PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236) * Remove cumsum_buffer initilization (sgl-project#7439) * [benchmark] fbgemm benchmark support bandwidth report and support fbgemm_cutlass_gmm (sgl-project#7422) * Support multi-thread model weight loading (sgl-project#7277) * [PD] NIXL: Register kv args in advance and cleanup finished requests (sgl-project#6717) * fix: Add `--model` as an alias for `--model-path` in server_args (sgl-project#7505) * misc: Improvement to serving_chat.py and add more ut (sgl-project#7489) * Fuse sorted_token_ids padding to moe_align_block_size kernel (sgl-project#7437) * [OAI] patch origin request_id logic (sgl-project#7508) * [PD][Spec] Fix hidden state transfer for spec decode (sgl-project#7516) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * EPLB support for MTP (sgl-project#7510) * clean duplicate code (sgl-project#7512) * [ci] add router benchmark script and CI (sgl-project#7498) * fix: force synchronization between TP workers when update_weights (sgl-project#6626) Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> * [CPU] [BF16] Call fused_experts_cpu, weight_packed_linear and bmm_cpu kernel in DeepSeek model (sgl-project#6641) Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> * [CI] Upgrade mooncake to v0.3.4.post2 to fix potential slice failed bug (sgl-project#7522) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * npu fused op (sgl-project#7386) Co-authored-by: Li Junwen <lijunwen13@hisilicon.com> * feat: send kvmetrics from sglang scheduler (sgl-project#6721) * [PD] Add different TP sizes support for no-MLA models (sgl-project#6793) Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com> * enable aiter fp8 blockscale quant (sgl-project#7520) * take aiter get_rope back (sgl-project#7521) * Fix typo of flash_cache (sgl-project#7513) * feat: add return hidden_states at async generation (sgl-project#7507) * minor: 'role' must be system/assistant/tool, but case insensitive for now (sgl-project#7499) * Fix FP8 KV Cache Support in FA3 Backend (sgl-project#7148) * Fix gathered_buffer issues in tbo (sgl-project#7531) * [PD] Raise error for incompatible mooncake version and some minor fixes (sgl-project#7527) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [CMake] Fix sgl-kernel CMakeLists for Blackwell (sgl-project#7543) * Add Tencent HunYuanMoEV1 model support (sgl-project#7549) * Update seed in CPU UTs to avoid flaky failure with single test (sgl-project#7544) * chore: improve ci bug reporting (sgl-project#7542) * chore: remove vlm unnecessary import (sgl-project#7541) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com> * chore: bump v0.4.8.post1 (sgl-project#7559) * [PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND (sgl-project#7330) * [Fix] incorrect assert in EPLB (sgl-project#7575) * Updates Gemma3n MLP layer to adapt latest transformers version (sgl-project#7573) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Fix MTP error when enabling two-batch overlap (sgl-project#7569) * Add e2e test for multi instance multi stage memory release/resume occupuation (sgl-project#7208) Signed-off-by: Ata Fatahi <immrata@gmail.com> * [CI] Add CI Testing for Prefill-Decode Disaggregation with Router (sgl-project#7540) * Updates transformers and timm dependencies (sgl-project#7577) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * feat: support compatibility between MTP and two-batch-overlap (sgl-project#7225) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * Move multimodal processors into a separate folder (sgl-project#7581) * Fix broken CI TestVILAServer (sgl-project#7610) * [router] add centralized configuration module for sgl-router (sgl-project#7588) * Fix: Minicpm (sgl-project#7612) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Hybrid kv cache for LLaMA4 (sgl-project#6563) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: tarinkk <rt572@physics.rutger.edu> Co-authored-by: tarinkk <rt572@rutgers.physics.edu> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> * [CPU] add optimizations for INT8 and FP8 DeepSeek (sgl-project#6769) Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com> * Tiny add logs for expert location updater (sgl-project#7308) * Fix flakiness in LoRA batch test. (sgl-project#7552) * [BUG] fix local_rank in initialize_dp_attention (sgl-project#7584) * Support dynamic LoRA loading / unloading in engine/server API (sgl-project#7446) * [PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated (sgl-project#7598) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * fix unit tests (sgl-project#7618) * Let ep_scatter support arbitrary strides / ue8m0 format (sgl-project#7309) * Let EP prefill support new DeepGEMM (sgl-project#7310) * docs: add gb200 nvl72 and a16z grant (sgl-project#7620) * oai: Adds support for OpenAI chat completions API in bench_serving (sgl-project#7036) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> * [bugfix] Remove PR comment posting from Rust benchmark workflow (sgl-project#7625) * [Minor] clean up multimodal processor and tokenizer manager (sgl-project#7624) * Add dsv3 fused a gemm to sgl-kernel (sgl-project#7630) * Add @mickqian as the CODEOWNERS of multimodal (sgl-project#7636) * Fix stream reasoning parser and Adds Kimi reasoning parser (sgl-project#7432) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Fix sgl-router startup crash (sgl-project#7619) * [bugfix] fix runtime dropping panic in editable (sgl-project#7628) * Move files related to EPLB (sgl-project#7580) * [misc] reduce weird rope_scaling_factor warning (sgl-project#7176) * [AMD] Add unit-test-sgl-kernel-amd to AMD CI (sgl-project#7539) * Update CODEOWNERS (sgl-project#7640) * [EAGLE] remove a wrong adjustment for page_size > 1 & topk > 1 in server_args.py (sgl-project#7643) * [CPU] add c++ kernel to bind CPU cores and memory node (sgl-project#7524) * Improve streaming, log_level, memory report, weight loading, and benchmark script (sgl-project#7632) Co-authored-by: Kan Wu <wukanustc@gmail.com> * Add dsv3 router gemm kernel (sgl-project#7627) * chore: upgrade flashinfer v0.2.7 jit (sgl-project#7663) * [doc] update lws doc for pd (sgl-project#7318) * Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes (sgl-project#7648) * Add small requirements for benchmark/parse_result tools (sgl-project#7671) * [CPU] remove process_group from inputs of shm_allreduce and shm_allgather (sgl-project#7486) * chore: bump sgl-kernel v0.2.1 (sgl-project#7675) * support llama4 eagle3 (sgl-project#6985) Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Shenggui Li <somerlee.9@gmail.com> Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: yizhang2077 <1109276519@qq.com> * Refactor mm processors and Enable mixed modality processing (sgl-project#7629) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * upgrade sgl kernel to 0.2.1 for main (sgl-project#7676) * add description for llama4 eagle3 (sgl-project#7688) * fix(model loader): use safe_open to prevent file handle leaks. (sgl-project#7684) * chore: upgrade flashinfer v0.2.7.post1 (sgl-project#7698) * Improve error handling for requests with unloaded LoRA path(s) (sgl-project#7642) * Apply dsv3_fused_a_gemm kernel (sgl-project#7635) * Fix GPTQMarlinMoE (sgl-project#7697) * [1/n] apply wna16marlin kernel in moe weight only quantization (sgl-project#7683) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: yych0745 <1398089567@qq.com> Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: 弋云 <yiyun.wyt@antgroup.com> Co-authored-by: walker-ai <2398833647@qq.com> * Apply dsv3 router gemm kernel for deepseek-r1 fp4 (sgl-project#7677) * [AMD] Temporarily disable test_no_overlap_scheduler and test_vision_chunked_prefill (sgl-project#7717) * [RL] add --skip-warmup (sgl-project#7416) * [RL] support update_weights_from_distributed with different group and multiple weights (sgl-project#7292) * [router] add --log-level to sgl-router (sgl-project#6512) * [b200] support trt-llm allreduce fuse rms_norm_add kernel (sgl-project#7621) * [CPU] Bind threads and numa node for each TP rank (sgl-project#6549) Co-authored-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * Support non-contiguous query input for extend/decode attention (sgl-project#7462) * Support updating weights at once by stopping all requests (sgl-project#6698) Signed-off-by: Tianyu Zhou <albert.zty@antgroup.com> Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com> * Fix num_tokens_pre_allocated in disaggregation log (sgl-project#7714) * [CPU] [sgl-kernel] set dispatch key of initialize to CatchAll (sgl-project#7734) * [CPU] fix all_reduce and all_gather (sgl-project#6770) Co-authored-by: blzheng <beilei.zheng@intel.com> * fix awq and dsv3 fused gemm compatible (sgl-project#7735) * [CI][Router] Fix bench_one_batch_server for pd router test (sgl-project#7731) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture (sgl-project#7278) Co-authored-by: HydraQYH <QYH820@Outlook.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> * fix dsv3 fused proj check (sgl-project#7738) * Ascend attention backend(PA&MLA) (sgl-project#7722) Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: VDV1985 <vladdv85@mail.ru> * [fix] fix dsv3_router_gemm filter (sgl-project#7750) * [CPU] refine CPU integration code (sgl-project#7647) * [CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size (sgl-project#6771) * support qwen3 dense model dp attention (sgl-project#7681) * [optimize] add two stream norm for qwen3 (sgl-project#7740) Co-authored-by: ispobock <ispobaoke@gmail.com> * feat: use D2D instead of H2H in pp (sgl-project#7673) Co-authored-by: alpha-baby <fujianhao1997@qq.com> * [Bug] add flashinfer bool check for fusedmoe in Qwen moe models (sgl-project#7723) * [fix] put cpu in the first priority in get_device() (sgl-project#7752) * [optimize] fuse renormalize into moe_topk_softmax (sgl-project#7744) Co-authored-by: ispobock <ispobaoke@gmail.com> * chore: bump sgl-kernel 0.2.2 (sgl-project#7755) * fix CI: update native api ipynb (sgl-project#7754) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * fuse renormal into moe topk softmax kernel python code (sgl-project#7751) Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: zhyncs <me@zhyncs.com> * Remove type conversion and fix id map in topk (sgl-project#7759) * Add V2-lite model test (sgl-project#7390) Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com> * refactor llama4 dp attention logic (sgl-project#7729) * fix(docs): fix the broken link in `docs/references/production_metrics.md` (sgl-project#7741) Signed-off-by: rudeigerc <rudeigerc@gmail.com> * [fix] update bench_speculative.py for compatibility (sgl-project#7764) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * Move mem_fraction_static adjustment for multimodal models to `server_args.py` & Fix session control & Other cleanups (sgl-project#7748) * [RL] Add --nccl-port to prevent port conflict (sgl-project#7418) * [RL] add pause and continue generation for async rl training (sgl-project#7419) * [Fix] Alloc return type error (sgl-project#7778) Signed-off-by: Capronir <839972205@qq.com> * [feat] Support EAGLE3 for Qwen (sgl-project#7745) Co-authored-by: 纬杭 <ximing.wxm@antgroup.com> Co-authored-by: zyksir <zyksir@outlook.com> * saving hidden_states.clone() (sgl-project#7705) * [1/n]: add cutlass W4A8 moe kernel for hopper architecture (sgl-project#7772) Signed-off-by: yangsijia.614 <yangsijia.614@bytedance.com> Co-authored-by: yicwang <yichen.wang@bytedance.com> * add model: qwen2-audio (sgl-project#7596) * Optimize Hopper CUTLASS FP8 Blockwise Grouped GEMM Kernel in Small K Scenario (sgl-project#7782) * Embedding parallel by attn_tp (sgl-project#7623) * fix: fix apply_shuffle_mul_sum (sgl-project#7444) * chore: bump sgl-kernel v0.2.3 (sgl-project#7784) * fix: use nvidia-nccl-cu12 2.27.5 (sgl-project#7787) * DP Attention with Auto DeepEP Dispatch (sgl-project#7222) * chore: upgrade sgl-kernel v0.2.3 (sgl-project#7786) * Fix incorrect spec_num_draft_tokens in draft_extend (sgl-project#7757) * [fix] fix misusing of is_cuda (sgl-project#7790) * Add treemask mode to build_eagle_tree & release sgl-kernel 0.2.3 (sgl-project#7756) Co-authored-by: Pranjal Shankhdhar <pranjal.ssh@gmail.com> * chore: bump sgl-kernel v0.2.4 (sgl-project#7800) * ci: fix port args (sgl-project#7792) * Fix CI test OOM issue. (sgl-project#7799) * chore: upgrade sgl-kernel v0.2.4 (sgl-project#7801) * chore: bump v0.4.9 (sgl-project#7802) * fix merge conflict issue * fix hpu attention nonetyep issue * fix alignment * fix alignment2 * Ci failure fixes * fix attention-backend choices --------- Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> Signed-off-by: huanglong <huanglong@linux.alibaba.com> Signed-off-by: Ata Fatahi <immrata@gmail.com> Signed-off-by: keru <rukeyang@gmail.com> Signed-off-by: Tianyu Zhou <albert.zty@antgroup.com> Signed-off-by: rudeigerc <rudeigerc@gmail.com> Signed-off-by: Kay Yan <kay.yan@daocloud.io> Signed-off-by: Capronir <839972205@qq.com> Signed-off-by: yangsijia.614 <yangsijia.614@bytedance.com> Signed-off-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: KavioYu <67678385+yukavio@users.noreply.github.com> Co-authored-by: kavioyu <kavioyu@tencent.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com> Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com> Co-authored-by: u4lr451 <u4lr451@gmail.com> Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: ch-wan <cwan39@gatech.edu> Co-authored-by: Yijie Zhu <762412795@qq.com> Co-authored-by: 刁莹煜 <diaoyingyu1@hisilicon.com> Co-authored-by: Charles Chen <pychen96@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com> Co-authored-by: jianan-gu <jianan.gu@intel.com> Co-authored-by: sdp <sdp@gnr799219.jf.intel.com> Co-authored-by: Binyao Jiang <byjiang1996@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: linzhuo <15313137931lz@gmail.com> Co-authored-by: ch-tiger1 <tiger@ch-tech.ip-ddns.com> Co-authored-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com> Co-authored-by: Atream <80757050+Atream@users.noreply.github.com> Co-authored-by: Li Hui <lambert80.ios@gmail.com> Co-authored-by: Huang Long <121648372+LLLL114@users.noreply.github.com> Co-authored-by: woodx <124784234+woodx9@users.noreply.github.com> Co-authored-by: Ata Fatahi <immrata@gmail.com> Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com> Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> Co-authored-by: Wenbo Yang <solrex@users.noreply.github.com> Co-authored-by: Chang Su <csu272@usc.edu> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Keyang Ru <rukeyang@gmail.com> Co-authored-by: ehuaa <ehuamail@163.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Lifu Huang <lifu.hlf@gmail.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: alcanderian <alcanderian@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: xutizhou <xutingz@nvidia.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Yuhong Guo <guoyuhong1985@outlook.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: Alex Sun <alex.s@amd.com> Co-authored-by: valarLip <103567126+valarLip@users.noreply.github.com> Co-authored-by: Francis <38564764+ssssnow@users.noreply.github.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: xianzhiT <xianzhitang@tencent.com> Co-authored-by: yilian49 <43861414+yilian49@users.noreply.github.com> Co-authored-by: DangKai <dangkai4u@outlook.com> Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> Co-authored-by: ll819214 <18801269230@163.com> Co-authored-by: Li Junwen <lijunwen13@hisilicon.com> Co-authored-by: zixuanzhang226 <zixuanzhang@bytedance.com> Co-authored-by: Hongbo Xu <1320612015@qq.com> Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-authored-by: Meng, Peng <pengmeng@tencent.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com> Co-authored-by: tarinkk <rt572@physics.rutger.edu> Co-authored-by: tarinkk <rt572@rutgers.physics.edu> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com> Co-authored-by: Sheng Qi <shengqi2018@pku.edu.cn> Co-authored-by: finetune <82650881+finetunej@users.noreply.github.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: Kan Wu <wukanustc@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: narutolhy <582909902@qq.com> Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Shenggui Li <somerlee.9@gmail.com> Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: Simon_CQK <cqk0100@gmail.com> Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: yych0745 <1398089567@qq.com> Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: 弋云 <yiyun.wyt@antgroup.com> Co-authored-by: walker-ai <2398833647@qq.com> Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com> Co-authored-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> Co-authored-by: Albert <albert.zty@antgroup.com> Co-authored-by: Ziming Huang <1520787127@qq.com> Co-authored-by: ayrnb <70835312+ayrnb@users.noreply.github.com> Co-authored-by: HydraQYH <QYH820@Outlook.com> Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: VDV1985 <vladdv85@mail.ru> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: TianyuZhang1214 <tianyuzhang1214@163.com> Co-authored-by: alpha-baby <fujianhao1997@qq.com> Co-authored-by: Yuchen Cheng <rudeigerc@gmail.com> Co-authored-by: Kay Yan <kay.yan@daocloud.io> Co-authored-by: Caproni <40862361+Capronir@users.noreply.github.com> Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com> Co-authored-by: 纬杭 <ximing.wxm@antgroup.com> Co-authored-by: zyksir <zyksir@outlook.com> Co-authored-by: SijiaYang <yangsijia.614@bytedance.com> Co-authored-by: yicwang <yichen.wang@bytedance.com> Co-authored-by: Leng Yue <lengyue@lengyue.me> Co-authored-by: Qi Yuhang <45795032+HydraQYH@users.noreply.github.com> Co-authored-by: Gang Chen <13298548+MoonBall@users.noreply.github.com> Co-authored-by: Pranjal Shankhdhar <pranjal.ssh@gmail.com> Co-authored-by: jay <jthakur@habana.ai>

ch-wan and others added 3 commits May 18, 2025 02:31

implement gather before attn

52b050d

Merge branch 'main' into cheng/gather_before_attn

abe5c79

wip: --enable-ep-moe drops accuracy

d9a745a

ch-wan changed the title ~~Fix~~ Fix All-gather after DP FFNs May 18, 2025

ch-wan mentioned this pull request May 18, 2025

[Bug] Start fails both in dp or tp mode #6297

Closed

5 tasks

ch-wan marked this pull request as ready for review May 18, 2025 08:24

ch-wan requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock, ByronHsu, zhaochenyang20 and xiezhq-hermann as code owners May 18, 2025 08:24

ch-wan and others added 7 commits May 18, 2025 01:29

Merge branch 'main' into cheng/fix-6297

bdbd802

Merge branch 'main' into cheng/gather_before_attn

2f34132

Merge branch 'main' into cheng/gather_before_attn

824254c

Merge branch 'main' into cheng/fix-6297

b324ca0

Merge commit 'eb8f02dd87acd8689c41d15a7c0f11f5eff914d0' into cheng/ga…

35267c3

…ther_before_attn

update communicator

7845798

Merge branch 'cheng/gather_before_attn' into cheng/fix-6297

95df400

ch-wan requested review from HaiShaw and BBuf as code owners May 27, 2025 09:12

ch-wan and others added 7 commits May 27, 2025 09:20

Merge commit 'b18416fbf869fd2d150937d3efcf9e75ee3fb278' into cheng/ga…

f6fc191

…ther_before_attn

fmt

eed7ce3

Merge branch 'cheng/gather_before_attn' into cheng/fix-6297

cdd6e9b

Merge branch 'main' into cheng/gather_before_attn

81c53e0

fix

75773aa

fix

8b138a1

fmt

fcaaeaf

update tests

d887f77

ch-wan mentioned this pull request Jun 16, 2025

[Bug] Cannot run DeepSeek R1 + EPMOE w/ or w/o dp attention(--dp-size 8 --enable-dp-attention --moe-dense-tp-size 1) #7055

Closed

5 tasks

ch-wan added 9 commits June 18, 2025 01:19

update test

713f87b

fix

d39549d

Merge commit 'e56685ac1bb881e58043fe5f2c4ae055905332ba' into cheng/fi…

dcf58ee

…x-6297

update tests

84c31ff

adapt to dp attn with mtp

57857af

format

1e7012e

fix mlp sync

5a1d9bd

Merge commit '09ae5b20f3123487f36097d284a1f535cd267e7b' into cheng/fi…

937451b

…x-6297

format

e4aab35

ch-wan added 2 commits June 20, 2025 12:10

add MTP tests

a14115f

update file name and intro

3c6de24

ch-wan changed the title ~~Fix All-gather after DP FFNs~~ [Feature] Comprehensive Parallelism Support Jun 20, 2025

ch-wan mentioned this pull request Jun 20, 2025

[Bug] Deepseek EP + DP Fail and Accuracy Crush #7041

Closed

5 tasks

ch-wan changed the title ~~[Feature] Comprehensive Parallelism Support~~ [Feature] Comprehensive Hybrid Parallelism Support Jun 20, 2025

gemini-code-assist bot reviewed Jun 20, 2025

View reviewed changes

zhyncs merged commit e879d8b into main Jun 20, 2025
46 of 53 checks passed

zhyncs deleted the cheng/fix-6297 branch June 20, 2025 21:43

Edenzzzz reviewed Jun 21, 2025

View reviewed changes

whybeyoung pushed a commit to whybeyoung/sglang that referenced this pull request Jun 24, 2025

[Feature] Comprehensive Hybrid Parallelism Support (sgl-project#6389)

a7e3c83

yilian49 pushed a commit to yilian49/sglang that referenced this pull request Jun 24, 2025

[Feature] Comprehensive Hybrid Parallelism Support (sgl-project#6389)

c15315b

chenxijun1029 pushed a commit to chenxijun1029/sglang that referenced this pull request Jul 17, 2025

[Feature] Comprehensive Hybrid Parallelism Support (sgl-project#6389)

c8d2413

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Comprehensive Hybrid Parallelism Support #6389

[Feature] Comprehensive Hybrid Parallelism Support #6389

Uh oh!

ch-wan commented May 18, 2025 •

edited

Loading

Uh oh!

whybeyoung commented Jun 19, 2025

Uh oh!

ch-wan commented Jun 20, 2025

Uh oh!

gemini-code-assist bot commented Jun 20, 2025

Uh oh!

ch-wan commented Jun 20, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 20, 2025

Uh oh!

gemini-code-assist bot Jun 20, 2025

Uh oh!

gemini-code-assist bot Jun 20, 2025

Uh oh!

gemini-code-assist bot Jun 20, 2025

Uh oh!

ch-wan commented Jun 20, 2025

Uh oh!

gemini-code-assist bot commented Jun 20, 2025

Uh oh!

Uh oh!

Edenzzzz Jun 21, 2025 •

edited

Loading

Uh oh!

ch-wan Jun 21, 2025

Uh oh!

Uh oh!

	if moe_dense_tp_size == 1 and global_server_args_dict["enable_dp_lm_head"]:
	local_batch.global_num_tokens = [num_tokens]
	local_batch.global_num_tokens_for_logprob = [num_tokens_for_logprob]

	if context.local_attn_dp_size != 1:
	if context.attn_tp_rank == 0:
	hidden_states += residual
	hidden_states, local_hidden_states = (
	forward_batch.gathered_buffer,
	hidden_states,
	)
	dp_gather_partial(hidden_states, local_hidden_states, forward_batch)
	dp_scatter(residual, hidden_states, forward_batch)
	if hidden_states.shape[0] != 0:
	hidden_states = layernorm(hidden_states)

[Feature] Comprehensive Hybrid Parallelism Support #6389

[Feature] Comprehensive Hybrid Parallelism Support #6389

Uh oh!

Conversation

ch-wan commented May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

whybeyoung commented Jun 19, 2025

Uh oh!

ch-wan commented Jun 20, 2025

Uh oh!

gemini-code-assist bot commented Jun 20, 2025

Uh oh!

ch-wan commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

ch-wan commented Jun 20, 2025

Uh oh!

gemini-code-assist bot commented Jun 20, 2025

Summary of Changes

Highlights

Uh oh!

Uh oh!

Edenzzzz Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ch-wan Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ch-wan commented May 18, 2025 •

edited

Loading

ch-wan commented Jun 20, 2025 •

edited

Loading

Edenzzzz Jun 21, 2025 •

edited

Loading