feat: mtp support dp-attention #6081

u4lr451 · 2025-05-07T08:40:09Z

Motivation

mtp support dp-attention

implemented MTP for DP-attention, also fixed related bugs [Bug] DP + MTP init failed with deepseek r1 #4783 [Bug] DP attention with Eagle worker raises AttributeError #4847 .
Enabled CUDA Graph support for both target and draft models at dp-attention.
Performance Optimizations: Refined gathered_buffer memory allocation during MTP on dp-attention, eliminates redundant GPU allocation (previously scaled by --speculative-num-draft-tokens multiplier),prevents unnecessary memory usage in both target and draft models infence. Benefit for DP concurrency by reducing memory contention, and decreases all_reduce communication overhead.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Accuracy

Benchmark
use MMLU benchmark

python3 bench_sglang.py  --data_dir data --nsub 20

Baseline

#two node 
export SGL_ENABLE_JIT_DEEPGEMM=0
python3 -m sglang.launch_server --model-path /sgl-workspace//DeepSeek-V3-0324 --dist-init-addr ${HOST_IP}:20000 --nnodes 2 --node-rank ${RANK}  --trust-remote-code --served-model-name DeepSeek-V3-0324 --context-length 65536 --tensor-parallel-size 16 --stream-output --host 0.0.0.0 --port 30000 --watchdog-timeout 240 --disable-radix-cache --schedule-policy fcfs --chunked-prefill-size 32768 --max-running-requests 24 --disable-overlap-schedule --attention-backend flashinfer --enable-metrics --log-requests

Average accuracy: 0.887

mtp with dp-attention

#two node 
export SGL_ENABLE_JIT_DEEPGEMM=0
python3 -m sglang.launch_server --model-path /sgl-workspace//DeepSeek-V3-0324 --dist-init-addr ${HOST_IP}:20000 --nnodes 2 --node-rank ${RANK} --trust-remote-code --served-model-name DeepSeek-V3-0324 --context-length 65536 --tensor-parallel-size 16 --stream-output --host 0.0.0.0 --port 30000 --watchdog-timeout 240 --disable-radix-cache --schedule-policy fcfs --chunked-prefill-size 32768 --max-running-requests 24 --disable-overlap-schedule --attention-backend flashinfer --disable-cuda-graph-padding --mem-fraction-static 0.60 --speculative-algo NEXTN --speculative-draft /sgl-workspace/SGLang/DeepSeek-V3-0324-NextN --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-metrics --log-requests --disable-cuda-graph --enable-nan-detection --enable-dp-attention --dp-size 8

Average accuracy: 0.887

python/sglang/srt/models/deepseek_nextn.py

lambert0312 · 2025-05-08T07:57:14Z

After testing, the error is as follows:

Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 314, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 405, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 523, in capture_one_batch_size
    torch.cuda.synchronize()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 985, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2266, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 272, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 85, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 190, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 239, in initialize
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1025, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 316, in __init__
    raise Exception(
Exception: Capture cuda graph failed: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable cuda graph by --disable-cuda-graph. (Not recommonded. Huge perf loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

When --disable-cuda-graph is set, start normally.

u4lr451 · 2025-05-08T17:08:09Z

After testing, the error is as follows:

Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 314, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 405, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 523, in capture_one_batch_size
    torch.cuda.synchronize()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 985, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

lambert0312 The latest commit( #5256543 has fixed this bug. Thanks!

lambert0312 · 2025-05-08T23:29:55Z

lambert0312 The latest commit( #5256543 has fixed this bug. Thanks!

@u4lr451 Great, it has been verified to work properly, but the speed is much slower than when dp-attention is not enabled. Why is this?

python/sglang/srt/speculative/eagle_utils.py

u4lr451 · 2025-05-09T08:44:51Z

lambert0312 The latest commit( #5256543 has fixed this bug. Thanks!

@u4lr451 Great, it has been verified to work properly, but the speed is much slower than when dp-attention is not enabled. Why is this?

@lambert0312 The choice between pure TP, or enable DP-attention depends on multiple factors, such as GPU model, request batch size/concurrency, DP parameters, business SLA requirements,etc.
Additionally:

MTP itself increases memory and compute overhead, The speedup ratio of DP-attention + MTP correlates with multiple factors : such as a) MTP acceptance rate , b) workloads, batch size/concurrency, c) Request balancing across DP workers .
When using DP-attention + MTP, the optimal capture_bs for CUDA graphs may differ.

u4lr451 · 2025-05-12T16:39:32Z

@ch-wan @fzyzcjy @merrymercy @zhyncs hi, would someone mind checking if this is ready to merge? Thanks!

zhangxiaolei123456 · 2025-05-13T02:41:04Z

Open DP attention, MTP, cuda graph found that the performance dropped very much, analyzed and found that it was because the reception rate dropped very much. This caused the throughput to drop.
disable cuda graph：
accept len: 3.69, gen throughput (token/s): 137.83

[2025-05-13 02:32:12 DP5 TP5] Decode batch. #running-req: 8, #token: 35785, token usage: 0.27, accept len: 3.64, gen throughput (token/s): 135.85, #queue-req: 0
[2025-05-13 02:32:12 DP6 TP6] Decode batch. #running-req: 8, #token: 35896, token usage: 0.27, accept len: 3.62, gen throughput (token/s): 135.27, #queue-req: 0
[2025-05-13 02:32:12 DP7 TP7] Decode batch. #running-req: 8, #token: 36092, token usage: 0.27, accept len: 3.69, gen throughput (token/s): 137.83, #queue-req: 0

open cuda graph：
accept len: 2.12, gen throughput (token/s): 83.78

[2025-05-13 01:57:33 DP1 TP1] Decode batch. #running-req: 8, #token: 29915, token usage: 0.22, accept len: 1.55, gen throughput (token/s): 61.31, #queue-req: 0
[2025-05-13 01:57:37 DP0 TP0] Decode batch. #running-req: 8, #token: 30629, token usage: 0.23, accept len: 2.12, gen throughput (token/s): 83.78, #queue-req: 0

@u4lr451

…upport_dp_attention

MiterV1 · 2025-06-16T10:05:40Z

bugs:
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [208,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [208,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [208,0,0], thread: [66,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [208,0,0], thread: [67,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [208,0,0], thread: [68,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [208,0,0], thread: [69,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [208,0,0], thread: [70,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [208,0,0], thread: [71,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [208,0,0], thread: [72,0,0] Assertion srcIndex < srcSelectDimSize failed.

…upport_dp_attention

…ist" This reverts commit 3f686b1.

…51/6081

ch-wan

Thank you for this excellent contribution. It represents a major optimization for boosting the throughput of DeepSeek-V3/R1, with its correctness and effectiveness verified by many contributors and users from the community. The current implementation looks solid to me.

For future PRs, consider these remaining optimizations:

Enabling CUDA graphs for idle batches during verify or draft_after_decode. This was previously implemented but reverted by me to unblock merging this PR.
Migrating DP attention support to #6995. The current setup requires capturing 3 CUDA graphs and creating 3 gathered_buffers, which consumes unnecessary memory.
Reducing scheduling overhead. The current approach may invoke all_gather_into_tensor twice to check for idle batches, potentially lowering end-to-end throughput in some scenarios.

Xuweijia-buaa · 2025-06-17T07:22:59Z

when I use following args for DeepSeek-R1 model, not use another draft model, like this:
https://docs.sglang.ai/references/deepseek.html#multi-token-prediction

--speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2

raise such Error:
AttributeError: 'DeepseekModelNextN' object has no attribute 'layers'

do you know why and how to fix it?

complete logs are:
File "sglang/python/sglang/srt/managers/scheduler.py", line 2576, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
File "sglang/python/sglang/srt/managers/scheduler.py", line 325, in init
self.draft_worker = EAGLEWorker(
File "sglang/python/sglang/srt/speculative/eagle_worker.py", line 124, in init
super().init(
File "sglang/python/sglang/srt/managers/tp_worker.py", line 78, in init
self.model_runner = ModelRunner(
File "sglang/python/sglang/srt/model_executor/model_runner.py", line 215, in init
self.initialize(min_per_gpu_memory)
File "sglang/python/sglang/srt/model_executor/model_runner.py", line 256, in initialize
self.load_model()
File "sglang/python/sglang/srt/model_executor/model_runner.py", line 550, in load_model
self.model = get_model(
File "sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "sglang/python/sglang/srt/model_loader/loader.py", line 516, in load_model
model.post_load_weights()
File "sglang/python/sglang/srt/models/deepseek_v2.py", line 1784, in post_load_weights
self.model.layers[layer_id].self_attn
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1940, in getattr
raise AttributeError(
AttributeError: 'DeepseekModelNextN' object has no attribute 'layers'

I use
--load-format dummy

Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: ch-wan <cwan39@gatech.edu>

…narios (sgl-project#6081)

ch-wan · 2025-07-08T08:29:45Z

@Xuweijia-buaa see this: #7506

@mickqian

* Use seq_len_fill_value in the cuda graph runners (sgl-project#7233) * support custom weight loader for model runner (sgl-project#7122) Co-authored-by: kavioyu <kavioyu@tencent.com> * Fix AMD speculative decoding (sgl-project#7252) * [Refactor] OAI Server components (sgl-project#7167) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * OAI Server Skeleton & Core Utility Endpoints (sgl-project#7179) * [amd] Opt dsv3 moe (sgl-project#7160) Co-authored-by: wunhuang <wunhuang@amd.com> * update ci node for xeon (sgl-project#7265) * feat: mtp support dp-attention (sgl-project#6081) Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: ch-wan <cwan39@gatech.edu> * support qwen2 running on ascend npu device (sgl-project#7022) Co-authored-by: 刁莹煜 <diaoyingyu1@hisilicon.com> * Fix Deepseek R1 0528 FP4 tensor name mismatch issue during weights loading. (sgl-project#7164) * bugfix(tool call ebnf): Fix EBNF generation for optional function parameters (sgl-project#7283) * Fix AWQ Dequant and Weight Loading of deepseek v2 (sgl-project#6842) * fix: resolve b200 dsv3 mtp issue (sgl-project#7286) * ci: Fix test_ebnf_generate_all_optional_function_params (sgl-project#7288) * fix: only enable flash_attn test on sm80 sm90 (sgl-project#7289) * [PD] Support get local ip from NIC for PD disaggregation (sgl-project#7237) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [PD] Add custom memory pool option to support Mooncake PD with NVLink (sgl-project#7264) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Upstreaming hicache bug fixes (sgl-project#7267) * Update python API of activation, topk, norm and rope and remove vllm dependency (sgl-project#6614) Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com> Co-authored-by: jianan-gu <jianan.gu@intel.com> Co-authored-by: sdp <sdp@gnr799219.jf.intel.com> * Fix hicache benchmark script bug - some sampled input_request is [] (sgl-project#7300) * chore: change logs from`INFO` to `DEBUG` for dp and add force quit for tokenizer manager (sgl-project#7251) * update invalid link in doc (sgl-project#7297) * Fix mini_lb for PD with long output: limit chunk size of decode response (sgl-project#7301) Signed-off-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> Co-authored-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> * Fix profiler error when there are idle passes (sgl-project#7003) * [pd] optimize dockerfile for pd disaggregation (sgl-project#7319) Co-authored-by: zhyncs <me@zhyncs.com> * Merge PDLB (Prefill-Decode Load Balancer) into SGLang Router (sgl-project#7096) * Add more refactored openai test & in CI (sgl-project#7284) * fix: resolve blackwell deepep image issue (sgl-project#7331) * add seed in CPU UTs to avoid flaky failure (sgl-project#7333) * Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (sgl-project#7099) * Reintroduce tiny fix sampler error when prob is not contiguous (sgl-project#7354) * [Refactor] Clean up radix cache related API (sgl-project#7303) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * Put `_normalize_rid` before other normalization in `io_struct` (sgl-project#7363) * [PD] Transfer hidden states for mtp when disaggregation (sgl-project#7242) * [Bugfix][PD] Set conclude state before clear when failure happens (sgl-project#7362) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * docs: update installation (sgl-project#7366) * [Docker] optimize dockerfile remove deepep and blackwell merge it to… (sgl-project#7343) Co-authored-by: Yineng Zhang <me@zhyncs.com> * Clean unused import for mimo mtp model (sgl-project#7370) * [Bugfix]Fix hang bug using dp attention with HiRadixCache (sgl-project#7159) Signed-off-by: huanglong <huanglong@linux.alibaba.com> * [Doc] add embedding rerank doc (sgl-project#7364) * Fix judgment condition for enabling Deepseek V3/R1 shared expert fusion optimization (sgl-project#7371) * Feat/refactor embedding server (sgl-project#7322) * Purge VerlEngine (sgl-project#7326) Signed-off-by: Ata Fatahi <immrata@gmail.com> * support return logprobs for pipeline (sgl-project#7356) Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> * [PD] Optimize custom mem pool usage and bump mooncake version (sgl-project#7393) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Support THUDM/GLM-4-0414 (GLM-Z1) Glm4ForCausalLM architecture. (sgl-project#5485) * Refine OpenAI serving entrypoint to remove batch requests (sgl-project#7372) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: Chang Su <csu272@usc.edu> * [Feature] Comprehensive Hybrid Parallelism Support (sgl-project#6389) * [DeepSeekNextN] fix: residual of head norm can be None (sgl-project#7398) * [OAI refactor] Add rerank and score serving (sgl-project#7399) Co-authored-by: Chang Su <chang.s.su@oracle.com> * [OAI Server Refactor] [ChatCompletions & Completions] Implement UsageInfo Processor (sgl-project#7360) Co-authored-by: Chang Su <chang.s.su@oracle.com> * Fix All-Gather under world size one (sgl-project#7219) * Optimize DP attn scheduling for speculative decoding (sgl-project#7285) * Update usage_processor.py (sgl-project#7402) * Fix 7285 Merge Conflicts (sgl-project#7403) * chore: upgrade mooncake-transfer-engine 0.3.4 (sgl-project#7401) * [OAI Server Refactor] [ChatCompletions & Completions] Support Return Hidden State (sgl-project#7329) Signed-off-by: keru <rukeyang@gmail.com> * Remove batches api in docs & example (sgl-project#7400) * [BugFix]: fix EmbeddingReqInput single input error (sgl-project#7396) * [BugFix]fix qwen25 invoke function call streaming responses with curly braces as the starting indicator (sgl-project#7394) * fix overlap pagecount (sgl-project#6984) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * fix: Fix CI test_function_call_parser.py (sgl-project#7425) * Fix CPU offloading for MLA memory pool (sgl-project#7409) * [fix] PD disaggregation when enable mtp and tp!=dp (sgl-project#7420) * feat(oai refactor): Replace `openai_api` with `entrypoints/openai` (sgl-project#7351) Co-authored-by: Jin Pan <jpan236@wisc.edu> * Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support (sgl-project#7412) * refactor(test): reorganize OpenAI test file structure (sgl-project#7408) * [minor] simplify the `TokenToKVPoolAllocator` (sgl-project#7414) * Tiny add logging for GC (sgl-project#7406) * FlashInfer NVFP4 MoE with EP & 2-stream shared expert (sgl-project#7327) Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: alcanderian <alcanderian@gmail.com> * Remove copy after bmm (sgl-project#7441) * Fix torch compile run (sgl-project#7391) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> * [misc] Add PD service discovery support in router (sgl-project#7361) * add fused moe config for qwen3 in triton3.3.1 (sgl-project#7445) * Fix CUDA Graph Check under Deepep with DP FFN (sgl-project#7451) * Update hyperparameter_tuning.md (sgl-project#7454) * feat: integrate deepgemm into EPMoE (sgl-project#6821) Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * Solve docker build failed in the virtual machine (sgl-project#7290) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> Co-authored-by: HAI <hixiao@gmail.com> * Fix a bug in BatchTokenIDOut & Misc style and dependency updates (sgl-project#7457) * [CI] Upgrade mooncake to 0.3.4.post1 to fix 8 gpu tests (sgl-project#7472) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Fix prefill OOM due to wrong token calculation when page > 1 (sgl-project#7397) * feat(func_call): Add more check in `BaseFormatDetector.parse_streaming_increment` (sgl-project#7479) * Fix dtype for idle input in spec decoding (sgl-project#7456) * update mooncake in dockerfile (sgl-project#7480) * kvcache io kernels and test case (sgl-project#7382) * [perf] slightly imporve DeepSeek-R1-FP4 TP8 (sgl-project#7481) * Quick fix for DeepGemm requant to also cover MTP. (sgl-project#7378) * Support weight loading without mmap (sgl-project#7469) * ci: Revert openai_server related tests in AMD suites (sgl-project#7449) * Perormance: Enable cuda graph for dp idle batch (sgl-project#7269) Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: ch-wan <cwan39@gatech.edu> * bugfix: Prevent global mutation of conv.stop_str across requests (sgl-project#7347) Co-authored-by: Chang Su <chang.s.su@oracle.com> * Fix RequestValidationError response format (sgl-project#7487) * Fix MTP with Deepseek R1 Fp4 (sgl-project#7376) * chore: bump sgl-kernel v0.2.0 (sgl-project#7490) * chore: bump v0.4.8 (sgl-project#7493) * [AMD] add aiter fused moe in DeepEP path (sgl-project#7268) * enable aiter_biased_grouped_topk kernel (sgl-project#7423) * [PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236) * Remove cumsum_buffer initilization (sgl-project#7439) * [benchmark] fbgemm benchmark support bandwidth report and support fbgemm_cutlass_gmm (sgl-project#7422) * Support multi-thread model weight loading (sgl-project#7277) * [PD] NIXL: Register kv args in advance and cleanup finished requests (sgl-project#6717) * fix: Add `--model` as an alias for `--model-path` in server_args (sgl-project#7505) * misc: Improvement to serving_chat.py and add more ut (sgl-project#7489) * Fuse sorted_token_ids padding to moe_align_block_size kernel (sgl-project#7437) * [OAI] patch origin request_id logic (sgl-project#7508) * [PD][Spec] Fix hidden state transfer for spec decode (sgl-project#7516) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * EPLB support for MTP (sgl-project#7510) * clean duplicate code (sgl-project#7512) * [ci] add router benchmark script and CI (sgl-project#7498) * fix: force synchronization between TP workers when update_weights (sgl-project#6626) Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> * [CPU] [BF16] Call fused_experts_cpu, weight_packed_linear and bmm_cpu kernel in DeepSeek model (sgl-project#6641) Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> * [CI] Upgrade mooncake to v0.3.4.post2 to fix potential slice failed bug (sgl-project#7522) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * npu fused op (sgl-project#7386) Co-authored-by: Li Junwen <lijunwen13@hisilicon.com> * feat: send kvmetrics from sglang scheduler (sgl-project#6721) * [PD] Add different TP sizes support for no-MLA models (sgl-project#6793) Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com> * enable aiter fp8 blockscale quant (sgl-project#7520) * take aiter get_rope back (sgl-project#7521) * Fix typo of flash_cache (sgl-project#7513) * feat: add return hidden_states at async generation (sgl-project#7507) * minor: 'role' must be system/assistant/tool, but case insensitive for now (sgl-project#7499) * Fix FP8 KV Cache Support in FA3 Backend (sgl-project#7148) * Fix gathered_buffer issues in tbo (sgl-project#7531) * [PD] Raise error for incompatible mooncake version and some minor fixes (sgl-project#7527) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [CMake] Fix sgl-kernel CMakeLists for Blackwell (sgl-project#7543) * Add Tencent HunYuanMoEV1 model support (sgl-project#7549) * Update seed in CPU UTs to avoid flaky failure with single test (sgl-project#7544) * chore: improve ci bug reporting (sgl-project#7542) * chore: remove vlm unnecessary import (sgl-project#7541) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com> * chore: bump v0.4.8.post1 (sgl-project#7559) * [PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND (sgl-project#7330) * [Fix] incorrect assert in EPLB (sgl-project#7575) * Updates Gemma3n MLP layer to adapt latest transformers version (sgl-project#7573) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Fix MTP error when enabling two-batch overlap (sgl-project#7569) * Add e2e test for multi instance multi stage memory release/resume occupuation (sgl-project#7208) Signed-off-by: Ata Fatahi <immrata@gmail.com> * [CI] Add CI Testing for Prefill-Decode Disaggregation with Router (sgl-project#7540) * Updates transformers and timm dependencies (sgl-project#7577) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * feat: support compatibility between MTP and two-batch-overlap (sgl-project#7225) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * Move multimodal processors into a separate folder (sgl-project#7581) * Fix broken CI TestVILAServer (sgl-project#7610) * [router] add centralized configuration module for sgl-router (sgl-project#7588) * Fix: Minicpm (sgl-project#7612) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Hybrid kv cache for LLaMA4 (sgl-project#6563) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: tarinkk <rt572@physics.rutger.edu> Co-authored-by: tarinkk <rt572@rutgers.physics.edu> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> * [CPU] add optimizations for INT8 and FP8 DeepSeek (sgl-project#6769) Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com> * Tiny add logs for expert location updater (sgl-project#7308) * Fix flakiness in LoRA batch test. (sgl-project#7552) * [BUG] fix local_rank in initialize_dp_attention (sgl-project#7584) * Support dynamic LoRA loading / unloading in engine/server API (sgl-project#7446) * [PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated (sgl-project#7598) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * fix unit tests (sgl-project#7618) * Let ep_scatter support arbitrary strides / ue8m0 format (sgl-project#7309) * Let EP prefill support new DeepGEMM (sgl-project#7310) * docs: add gb200 nvl72 and a16z grant (sgl-project#7620) * oai: Adds support for OpenAI chat completions API in bench_serving (sgl-project#7036) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> * [bugfix] Remove PR comment posting from Rust benchmark workflow (sgl-project#7625) * [Minor] clean up multimodal processor and tokenizer manager (sgl-project#7624) * Add dsv3 fused a gemm to sgl-kernel (sgl-project#7630) * Add @mickqian as the CODEOWNERS of multimodal (sgl-project#7636) * Fix stream reasoning parser and Adds Kimi reasoning parser (sgl-project#7432) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Fix sgl-router startup crash (sgl-project#7619) * [bugfix] fix runtime dropping panic in editable (sgl-project#7628) * Move files related to EPLB (sgl-project#7580) * [misc] reduce weird rope_scaling_factor warning (sgl-project#7176) * [AMD] Add unit-test-sgl-kernel-amd to AMD CI (sgl-project#7539) * Update CODEOWNERS (sgl-project#7640) * [EAGLE] remove a wrong adjustment for page_size > 1 & topk > 1 in server_args.py (sgl-project#7643) * [CPU] add c++ kernel to bind CPU cores and memory node (sgl-project#7524) * Improve streaming, log_level, memory report, weight loading, and benchmark script (sgl-project#7632) Co-authored-by: Kan Wu <wukanustc@gmail.com> * Add dsv3 router gemm kernel (sgl-project#7627) * chore: upgrade flashinfer v0.2.7 jit (sgl-project#7663) * [doc] update lws doc for pd (sgl-project#7318) * Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes (sgl-project#7648) * Add small requirements for benchmark/parse_result tools (sgl-project#7671) * [CPU] remove process_group from inputs of shm_allreduce and shm_allgather (sgl-project#7486) * chore: bump sgl-kernel v0.2.1 (sgl-project#7675) * support llama4 eagle3 (sgl-project#6985) Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Shenggui Li <somerlee.9@gmail.com> Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: yizhang2077 <1109276519@qq.com> * Refactor mm processors and Enable mixed modality processing (sgl-project#7629) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * upgrade sgl kernel to 0.2.1 for main (sgl-project#7676) * add description for llama4 eagle3 (sgl-project#7688) * fix(model loader): use safe_open to prevent file handle leaks. (sgl-project#7684) * chore: upgrade flashinfer v0.2.7.post1 (sgl-project#7698) * Improve error handling for requests with unloaded LoRA path(s) (sgl-project#7642) * Apply dsv3_fused_a_gemm kernel (sgl-project#7635) * Fix GPTQMarlinMoE (sgl-project#7697) * [1/n] apply wna16marlin kernel in moe weight only quantization (sgl-project#7683) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: yych0745 <1398089567@qq.com> Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: 弋云 <yiyun.wyt@antgroup.com> Co-authored-by: walker-ai <2398833647@qq.com> * Apply dsv3 router gemm kernel for deepseek-r1 fp4 (sgl-project#7677) * [AMD] Temporarily disable test_no_overlap_scheduler and test_vision_chunked_prefill (sgl-project#7717) * [RL] add --skip-warmup (sgl-project#7416) * [RL] support update_weights_from_distributed with different group and multiple weights (sgl-project#7292) * [router] add --log-level to sgl-router (sgl-project#6512) * [b200] support trt-llm allreduce fuse rms_norm_add kernel (sgl-project#7621) * [CPU] Bind threads and numa node for each TP rank (sgl-project#6549) Co-authored-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * Support non-contiguous query input for extend/decode attention (sgl-project#7462) * Support updating weights at once by stopping all requests (sgl-project#6698) Signed-off-by: Tianyu Zhou <albert.zty@antgroup.com> Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com> * Fix num_tokens_pre_allocated in disaggregation log (sgl-project#7714) * [CPU] [sgl-kernel] set dispatch key of initialize to CatchAll (sgl-project#7734) * [CPU] fix all_reduce and all_gather (sgl-project#6770) Co-authored-by: blzheng <beilei.zheng@intel.com> * fix awq and dsv3 fused gemm compatible (sgl-project#7735) * [CI][Router] Fix bench_one_batch_server for pd router test (sgl-project#7731) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture (sgl-project#7278) Co-authored-by: HydraQYH <QYH820@Outlook.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> * fix dsv3 fused proj check (sgl-project#7738) * Ascend attention backend(PA&MLA) (sgl-project#7722) Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: VDV1985 <vladdv85@mail.ru> * [fix] fix dsv3_router_gemm filter (sgl-project#7750) * [CPU] refine CPU integration code (sgl-project#7647) * [CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size (sgl-project#6771) * support qwen3 dense model dp attention (sgl-project#7681) * [optimize] add two stream norm for qwen3 (sgl-project#7740) Co-authored-by: ispobock <ispobaoke@gmail.com> * feat: use D2D instead of H2H in pp (sgl-project#7673) Co-authored-by: alpha-baby <fujianhao1997@qq.com> * [Bug] add flashinfer bool check for fusedmoe in Qwen moe models (sgl-project#7723) * [fix] put cpu in the first priority in get_device() (sgl-project#7752) * [optimize] fuse renormalize into moe_topk_softmax (sgl-project#7744) Co-authored-by: ispobock <ispobaoke@gmail.com> * chore: bump sgl-kernel 0.2.2 (sgl-project#7755) * fix CI: update native api ipynb (sgl-project#7754) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * fuse renormal into moe topk softmax kernel python code (sgl-project#7751) Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: zhyncs <me@zhyncs.com> * Remove type conversion and fix id map in topk (sgl-project#7759) * Add V2-lite model test (sgl-project#7390) Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com> * refactor llama4 dp attention logic (sgl-project#7729) * fix(docs): fix the broken link in `docs/references/production_metrics.md` (sgl-project#7741) Signed-off-by: rudeigerc <rudeigerc@gmail.com> * [fix] update bench_speculative.py for compatibility (sgl-project#7764) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * Move mem_fraction_static adjustment for multimodal models to `server_args.py` & Fix session control & Other cleanups (sgl-project#7748) * [RL] Add --nccl-port to prevent port conflict (sgl-project#7418) * [RL] add pause and continue generation for async rl training (sgl-project#7419) * [Fix] Alloc return type error (sgl-project#7778) Signed-off-by: Capronir <839972205@qq.com> * [feat] Support EAGLE3 for Qwen (sgl-project#7745) Co-authored-by: 纬杭 <ximing.wxm@antgroup.com> Co-authored-by: zyksir <zyksir@outlook.com> * saving hidden_states.clone() (sgl-project#7705) * [1/n]: add cutlass W4A8 moe kernel for hopper architecture (sgl-project#7772) Signed-off-by: yangsijia.614 <yangsijia.614@bytedance.com> Co-authored-by: yicwang <yichen.wang@bytedance.com> * add model: qwen2-audio (sgl-project#7596) * Optimize Hopper CUTLASS FP8 Blockwise Grouped GEMM Kernel in Small K Scenario (sgl-project#7782) * Embedding parallel by attn_tp (sgl-project#7623) * fix: fix apply_shuffle_mul_sum (sgl-project#7444) * chore: bump sgl-kernel v0.2.3 (sgl-project#7784) * fix: use nvidia-nccl-cu12 2.27.5 (sgl-project#7787) * DP Attention with Auto DeepEP Dispatch (sgl-project#7222) * chore: upgrade sgl-kernel v0.2.3 (sgl-project#7786) * Fix incorrect spec_num_draft_tokens in draft_extend (sgl-project#7757) * [fix] fix misusing of is_cuda (sgl-project#7790) * Add treemask mode to build_eagle_tree & release sgl-kernel 0.2.3 (sgl-project#7756) Co-authored-by: Pranjal Shankhdhar <pranjal.ssh@gmail.com> * chore: bump sgl-kernel v0.2.4 (sgl-project#7800) * ci: fix port args (sgl-project#7792) * Fix CI test OOM issue. (sgl-project#7799) * chore: upgrade sgl-kernel v0.2.4 (sgl-project#7801) * chore: bump v0.4.9 (sgl-project#7802) * fix merge conflict issue * fix hpu attention nonetyep issue * fix alignment * fix alignment2 * Ci failure fixes * fix attention-backend choices --------- Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> Signed-off-by: huanglong <huanglong@linux.alibaba.com> Signed-off-by: Ata Fatahi <immrata@gmail.com> Signed-off-by: keru <rukeyang@gmail.com> Signed-off-by: Tianyu Zhou <albert.zty@antgroup.com> Signed-off-by: rudeigerc <rudeigerc@gmail.com> Signed-off-by: Kay Yan <kay.yan@daocloud.io> Signed-off-by: Capronir <839972205@qq.com> Signed-off-by: yangsijia.614 <yangsijia.614@bytedance.com> Signed-off-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: KavioYu <67678385+yukavio@users.noreply.github.com> Co-authored-by: kavioyu <kavioyu@tencent.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com> Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com> Co-authored-by: u4lr451 <u4lr451@gmail.com> Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: ch-wan <cwan39@gatech.edu> Co-authored-by: Yijie Zhu <762412795@qq.com> Co-authored-by: 刁莹煜 <diaoyingyu1@hisilicon.com> Co-authored-by: Charles Chen <pychen96@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com> Co-authored-by: jianan-gu <jianan.gu@intel.com> Co-authored-by: sdp <sdp@gnr799219.jf.intel.com> Co-authored-by: Binyao Jiang <byjiang1996@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: linzhuo <15313137931lz@gmail.com> Co-authored-by: ch-tiger1 <tiger@ch-tech.ip-ddns.com> Co-authored-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com> Co-authored-by: Atream <80757050+Atream@users.noreply.github.com> Co-authored-by: Li Hui <lambert80.ios@gmail.com> Co-authored-by: Huang Long <121648372+LLLL114@users.noreply.github.com> Co-authored-by: woodx <124784234+woodx9@users.noreply.github.com> Co-authored-by: Ata Fatahi <immrata@gmail.com> Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com> Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> Co-authored-by: Wenbo Yang <solrex@users.noreply.github.com> Co-authored-by: Chang Su <csu272@usc.edu> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Keyang Ru <rukeyang@gmail.com> Co-authored-by: ehuaa <ehuamail@163.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Lifu Huang <lifu.hlf@gmail.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: alcanderian <alcanderian@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: xutizhou <xutingz@nvidia.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Yuhong Guo <guoyuhong1985@outlook.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: Alex Sun <alex.s@amd.com> Co-authored-by: valarLip <103567126+valarLip@users.noreply.github.com> Co-authored-by: Francis <38564764+ssssnow@users.noreply.github.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: xianzhiT <xianzhitang@tencent.com> Co-authored-by: yilian49 <43861414+yilian49@users.noreply.github.com> Co-authored-by: DangKai <dangkai4u@outlook.com> Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> Co-authored-by: ll819214 <18801269230@163.com> Co-authored-by: Li Junwen <lijunwen13@hisilicon.com> Co-authored-by: zixuanzhang226 <zixuanzhang@bytedance.com> Co-authored-by: Hongbo Xu <1320612015@qq.com> Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-authored-by: Meng, Peng <pengmeng@tencent.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com> Co-authored-by: tarinkk <rt572@physics.rutger.edu> Co-authored-by: tarinkk <rt572@rutgers.physics.edu> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com> Co-authored-by: Sheng Qi <shengqi2018@pku.edu.cn> Co-authored-by: finetune <82650881+finetunej@users.noreply.github.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: Kan Wu <wukanustc@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: narutolhy <582909902@qq.com> Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Shenggui Li <somerlee.9@gmail.com> Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: Simon_CQK <cqk0100@gmail.com> Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: yych0745 <1398089567@qq.com> Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: 弋云 <yiyun.wyt@antgroup.com> Co-authored-by: walker-ai <2398833647@qq.com> Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com> Co-authored-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> Co-authored-by: Albert <albert.zty@antgroup.com> Co-authored-by: Ziming Huang <1520787127@qq.com> Co-authored-by: ayrnb <70835312+ayrnb@users.noreply.github.com> Co-authored-by: HydraQYH <QYH820@Outlook.com> Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: VDV1985 <vladdv85@mail.ru> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: TianyuZhang1214 <tianyuzhang1214@163.com> Co-authored-by: alpha-baby <fujianhao1997@qq.com> Co-authored-by: Yuchen Cheng <rudeigerc@gmail.com> Co-authored-by: Kay Yan <kay.yan@daocloud.io> Co-authored-by: Caproni <40862361+Capronir@users.noreply.github.com> Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com> Co-authored-by: 纬杭 <ximing.wxm@antgroup.com> Co-authored-by: zyksir <zyksir@outlook.com> Co-authored-by: SijiaYang <yangsijia.614@bytedance.com> Co-authored-by: yicwang <yichen.wang@bytedance.com> Co-authored-by: Leng Yue <lengyue@lengyue.me> Co-authored-by: Qi Yuhang <45795032+HydraQYH@users.noreply.github.com> Co-authored-by: Gang Chen <13298548+MoonBall@users.noreply.github.com> Co-authored-by: Pranjal Shankhdhar <pranjal.ssh@gmail.com> Co-authored-by: jay <jthakur@habana.ai>

u4lr451 requested review from Ying1123, merrymercy, rkooo567, kssteven418, hnyls2002, zhyncs, ispobock, ByronHsu, xiezhq-hermann, HaiShaw and ch-wan as code owners May 7, 2025 08:40

zhyncs assigned merrymercy and ispobock May 7, 2025

u4lr451 force-pushed the feature_mtp_support_dp_attention branch from 7b9e06d to 9441246 Compare May 7, 2025 17:50

u4lr451 changed the title ~~feat: mtp support dp-attention (#6080)~~ feat: mtp support dp-attention May 7, 2025

ch-wan linked an issue May 7, 2025 that may be closed by this pull request

[Feature] mtp support dp-attention #6080

Closed

2 tasks

zhyncs assigned fzyzcjy and ch-wan May 8, 2025

DeepTecher reviewed May 8, 2025

View reviewed changes

python/sglang/srt/models/deepseek_nextn.py Outdated Show resolved Hide resolved

DeepTecher reviewed May 9, 2025

View reviewed changes

python/sglang/srt/speculative/eagle_utils.py Outdated Show resolved Hide resolved

u4lr451 force-pushed the feature_mtp_support_dp_attention branch 3 times, most recently from adcb787 to b61e3a6 Compare May 12, 2025 15:55

austindeng added 5 commits June 15, 2025 18:58

fix refactor bug

c07ba77

Merge remote-tracking branch 'github/main' into u4lr451:feature_mtp_s…

ff07187

…upport_dp_attention

fix enable_dp_lm_head when dp-size == tp-size

97f531b

Performance: Support enabling CUDA graph when idle batches exist

3f686b1

Merge remote-tracking branch 'github/main' into u4lrssh.feature_mtp_s…

5ae3c3d

…upport_dp_attention

MiterV1 mentioned this pull request Jun 16, 2025

[Bug] [__recompiles] - 1/0: topk_index._base.size()[0] == topk_index.size()[0] # (unknown var s1, please file a bug) #7230

Closed

5 tasks

Atream mentioned this pull request Jun 16, 2025

[PD] Transfer hidden states for mtp when disaggregation #7242

Merged

6 tasks

austindeng and others added 9 commits June 16, 2025 21:51

Merge remote-tracking branch 'github/main' into u4lr451:feature_mtp_s…

f3854ee

…upport_dp_attention

refine code for dp lm head

841defa

Merge branch 'main' into feature_mtp_support_dp_attention

2f64ad7

Revert "Performance: Support enabling CUDA graph when idle batches ex…

a279680

…ist" This reverts commit 3f686b1.

add a note

038ca0f

Merge commit '873ae12cee348dcb579a4c7456d789ef4441f3bf' into pr/u4lr4…

3bc16e4

…51/6081

Merge branch 'main' into feature_mtp_support_dp_attention

16f8a63

fix merge error

e4bf571

clean code and add comments

3a5b9d5

ch-wan approved these changes Jun 17, 2025

View reviewed changes

Merge branch 'main' into feature_mtp_support_dp_attention

a2effc0

zhyncs merged commit 10d60cd into sgl-project:main Jun 17, 2025
49 of 52 checks passed

u4lr451 mentioned this pull request Jun 17, 2025

Perormance: Enable cuda graph for dp idle batch #7269

Merged

6 tasks

coco-alen pushed a commit to jinleic/sglang that referenced this pull request Jun 20, 2025

fix: Adjust the init_cuda_graph_state and fixbug (sgl-project#6081)

4225684

coco-alen pushed a commit to jinleic/sglang that referenced this pull request Jun 20, 2025

Performance: Eliminate performance impact in non-dp-attention+mtp sce…

a1c0a4b

…narios (sgl-project#6081)

coco-alen pushed a commit to jinleic/sglang that referenced this pull request Jun 20, 2025

Added test cases for dp-attention + mtp (sgl-project#6081)

1aaea3a

coco-alen pushed a commit to jinleic/sglang that referenced this pull request Jun 20, 2025

compatibility for fa3 (sgl-project#6081)

80f5817

freeliuzc mentioned this pull request Jul 1, 2025

[Usage] Did MTP suppport disaggregation with radixcache(prefix cache) enabled at the same time? #7694

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: mtp support dp-attention #6081

feat: mtp support dp-attention #6081

Uh oh!

u4lr451 commented May 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

lambert0312 commented May 8, 2025 •

edited

Loading

Uh oh!

u4lr451 commented May 8, 2025 •

edited

Loading

Uh oh!

lambert0312 commented May 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

u4lr451 commented May 9, 2025

Uh oh!

u4lr451 commented May 12, 2025 •

edited

Loading

Uh oh!

zhangxiaolei123456 commented May 13, 2025 •

edited

Loading

Uh oh!

MiterV1 commented Jun 16, 2025

Uh oh!

ch-wan left a comment

Uh oh!

Xuweijia-buaa commented Jun 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

ch-wan commented Jul 8, 2025

Uh oh!

Uh oh!

feat: mtp support dp-attention #6081

feat: mtp support dp-attention #6081

Uh oh!

Conversation

u4lr451 commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Checklist

Accuracy

Uh oh!

Uh oh!

lambert0312 commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

u4lr451 commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lambert0312 commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

u4lr451 commented May 9, 2025

Uh oh!

u4lr451 commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangxiaolei123456 commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MiterV1 commented Jun 16, 2025

Uh oh!

ch-wan left a comment

Choose a reason for hiding this comment

Uh oh!

Xuweijia-buaa commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ch-wan commented Jul 8, 2025

Uh oh!

Uh oh!

u4lr451 commented May 7, 2025 •

edited

Loading

lambert0312 commented May 8, 2025 •

edited

Loading

u4lr451 commented May 8, 2025 •

edited

Loading

lambert0312 commented May 8, 2025 •

edited

Loading

u4lr451 commented May 12, 2025 •

edited

Loading

zhangxiaolei123456 commented May 13, 2025 •

edited

Loading

Xuweijia-buaa commented Jun 17, 2025 •

edited

Loading