[PD] Add NIXL transfer backend #5477

trevor-m · 2025-04-16T18:29:50Z

Motivation

This PR adds NIXL as a disaggregation transfer backend via --disaggregation-transfer-backend nixl.

This PR replaces my previous draft #5006

Installing NIXL

The easiest way to use NIXL is to start with an NVIDIA pytorch container which already has dependencies like UCX installed. Otherwise you'll need to follow the instructions in the NIXL readme to install those dependencies.

# Inside docker image nvcr.io/nvidia/pytorch:25.03-py3
git clone https://github.com/ai-dynamo/nixl.git
cd nixl
pip install . --config-settings=setup-args="-Ducx_path=/opt/hpcx/ucx"

Example usage:

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode prefill --port 30001 --disaggregation-transfer-backend nixl

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode decode --port 30002 --base-gpu-id 1 --disaggregation-transfer-backend nixl

python3 -m sglang.srt.disaggregation.mini_lb --prefill http://0.0.0.0:30001/ --decode http://0.0.0.0:30002/ --host 0.0.0.0 --port 30000

curl -X POST http://127.0.0.1:30000/generate -H "Content-Type: application/json" -d '{
  "text": "Let me tell you a lonnng story ",
  "sampling_params": {
    "temperature": 0
  }
}'

Example usage with tp-size 2:

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode prefill --port 30001 --disaggregation-transfer-backend nixl --tp-size 2

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode decode --port 30002 --base-gpu-id 2 --disaggregation-transfer-backend nixl --tp-size 2

python3 -m sglang.srt.disaggregation.mini_lb --prefill http://0.0.0.0:30001/ --decode http://0.0.0.0:30002/ --host 0.0.0.0 --port 30000

curl -X POST http://127.0.0.1:30000/generate -H "Content-Type: application/json" -d '{
  "text": "Let me tell you a lonnng story ",
  "sampling_params": {
    "temperature": 0
  }
}'

Modifications

I reused the MooncakeKVBootstrapServer with some modifications to allow the prefill and decode servers to find eachother.
Unlike the mooncake transfer engine, there is no multithreading for the sender or receiver since the transfer is done asynchronously. We also don't need to use sockets to notify the receiver when the transfer is done since the receiver can check the status through nixl notifications.

Chunked prefill is supported.

Benchmark

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 50 --random-input 8192 --random-output 512 --random-range-ratio 0.5 --port 30000 --dataset-name "random" --model meta-llama/Llama-3.1-8B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 1.0

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    1.0
Max reqeuest concurrency:                not set
Successful requests:                     50
Benchmark duration (s):                  107.70
Total input tokens:                      304483
Total generated tokens:                  19429
Total generated tokens (retokenized):    19429
Request throughput (req/s):              0.46
Input token throughput (tok/s):          2827.22
Output token throughput (tok/s):         180.40
Total token throughput (tok/s):          3007.62
Concurrency:                             18.39
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   39603.46
Median E2E Latency (ms):                 41336.43
---------------Time to First Token----------------
Mean TTFT (ms):                          37582.10
Median TTFT (ms):                        39873.36
P99 TTFT (ms):                           68202.43
---------------Inter-Token Latency----------------
Mean ITL (ms):                           5.22
Median ITL (ms):                         5.16
P95 ITL (ms):                            5.77
P99 ITL (ms):                            5.99
Max ITL (ms):                            67.70
==================================================

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

trevor-m · 2025-04-16T18:31:03Z

@ByronHsu @hnyls2002 @jokerwyt Could you please take a look?

jokerwyt · 2025-04-17T03:31:32Z

python/sglang/srt/disaggregation/nixl/conn.py

+        self.started_transfer = False
+
+        # NOTE: key distinguished by bootstrap_addr and engine_rank
+        bootstrap_key = f"{self.bootstrap_addr}_{self.kv_mgr.kv_args.engine_rank}"


Hi, engine_rank is shared across this TP worker. Why do we need to put it into bootstrap_key to distinguish different remote prefill TP workers?

Thank you @jokerwyt for the review!

Each TP worker has a separate process, so the engine rank is used to distinguish between those.

jokerwyt · 2025-04-17T03:59:35Z

python/sglang/srt/disaggregation/nixl/conn.py

+        self.port = port
+        self.app = web.Application()
+        self.store = dict()
+        self.lock = asyncio.Lock()


Hi, why do we need this lock since the bootstrap server will run in a single thread?

This is needed since the boostrap server is using python coroutines.
The bootstrap server code is the same as mooncake's:

sglang/python/sglang/srt/disaggregation/mooncake/conn.py

Line 479 in 06d0a3d

class MooncakeKVBootstrapServer(BaseKVBootstrapServer):

jokerwyt · 2025-04-17T04:09:25Z

python/sglang/srt/disaggregation/nixl/conn.py

+            self.transfer_infos[bootstrap_room] = TransferInfo.from_zmq(
+                waiting_req_bytes
+            )
+            assert bootstrap_room == self.transfer_infos[bootstrap_room].room


How do we make sure the coming TransferInfo is for the request we want to transfer?
Is it only for max_running_request=1 right now?

Hmm, good point. I've been using the default max_running_request and didn't run into any issue but it may just be by chance. Let me fix this

Just pushed a fix for this, thanks.

jokerwyt · 2025-04-18T02:43:21Z

Cool. I think we are ready to merge now! @ByronHsu @hnyls2002

jokerwyt · 2025-04-18T12:21:07Z

python/sglang/srt/disaggregation/nixl/conn.py

+            # Cleanup
+            del self.num_kvs_expected[room]
+            del self.received_kvs[room]
+            del self.received_aux[room]


I found a concurrency bug here during testing.
We have many TP workers, and they call check_transfer_done separately. If we do the clean up here, requests will have no chance to get KVPoll.Success from all TP workers.
We may need a flag in KVReceiver to check whether check_transfer_done returns True in previous poll.

Thanks @jokerwyt, let me fix that.

Hi @jokerwyt I removed the cleanup for now to fix the bug.
I think we can add proper cleanup later (also needed for kvsender) but that might require more hooks in decode.py and prefill.py.

Okay, it can be a temporary fix for testing.

decode.py and prefill.py will manage the lifecycle of KVReceiver and KVSender. Can we put the related flags and data structure in them so that clean up can happen naturally?

msharmavikram · 2025-04-21T17:03:45Z

@trevor-m nixl supports pip install nixl (https://pypi.org/project/nixl/)

ayrnb · 2025-05-12T11:46:47Z

@trevor-m Does the NIXL transfer backend support DeepEP?

* fix: update pr-test-sgl-kernel (sgl-project#5399) * kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381) * chore: bump sgl-kernel 0.0.9 (sgl-project#5400) * chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401) * Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406) * Fix bench_serving with random-ids (sgl-project#5214) * [misc] fix ci flaky case (sgl-project#5352) * [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412) * Support dynamic connection and TP 16 (sgl-project#5351) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416) * [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: ybyang <ybyang7@iflytek.com> * Distinguish bootstrap key only in decode server (sgl-project#5422) * [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423) * [minor] cleanup cmakelists.txt (sgl-project#5420) * bugfix: fix merge_state_v2 cuda graph (sgl-project#5419) * chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430) * fix: solve release issue (sgl-project#5434) * BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431) * feat: update model_specific_adjustment (sgl-project#5344) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> * chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436) * Fix ignore_eos parameter when loading a chat template (sgl-project#5264) * add attention backend supporting matrix in the doc (sgl-project#5211) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> * Support BNB quantization for llama/mllama (sgl-project#5038) Co-authored-by: Yuhao Yang <yyh073@foxmail.com> * [Docs] Update start/install.md (sgl-project#5398) * [Minor] Move torch.compile patch to a better place (sgl-project#5397) * [Bug fix] need record start time in pd mode (sgl-project#5425) * Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113) * chore: bump v0.4.5.post1 (sgl-project#5445) * Fix several minor issues in PD disaggregation (sgl-project#5444) * [doc] Update benchmark_and_profiling.md (sgl-project#5449) * Update cutlass dependency. (sgl-project#5447) * add multi-lora feature in README.md (sgl-project#5463) * Clean up imports (sgl-project#5467) * [verl] Modify the update_weights func to align with verl's resharding (sgl-project#5345) Co-authored-by: Chayenne <zhaochen20@outlook.com> * [Model Support] unsloth/Phi-4-mini bnb model (sgl-project#4982) Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> * Update attention_backend.md: plural form (sgl-project#5489) * Add test for flash_attn_varlen_func kernel (sgl-project#5484) * Deprecate disable-mla (sgl-project#5481) * Deprecate enable-flashinfer-mla and enable-flashmla (sgl-project#5480) * Feat/support encoder model (like bert) (sgl-project#4887) * Enable local attention during decode (sgl-project#5479) * Refactor DeepSeek decoder layer branches (sgl-project#5205) * Fix a link in sgl-kernel/README.md (sgl-project#5493) * [Bug fix] use correct func path in deepseek (sgl-project#5496) Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> * Doc: fix problems of the 'Execute Notebooks / run-all-notebooks' ci caused by the unstability of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B (sgl-project#5503) * [Feat] Update sgl-kernel flashinfer to latest main version (sgl-project#5500) Co-authored-by: zhyncs <me@zhyncs.com> * Fix: Incorrect parameters passed to forward_batch_generation (sgl-project#5506) (sgl-project#5511) * Fix: fix the exception 'the memory capacity is unbalanced. Some GPUs … (sgl-project#5426) Co-authored-by: ocss884 <ocss.lin@gmail.com> * [docs] Fix several consistency issues in sampling_params.md (sgl-project#5373) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * Configuration qwen2_moe.py - qkv_bias now in transformers (sgl-project#5512) * Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 (sgl-project#4836) * Sgl kernel fused_moe_gate support n_shared_experts (sgl-project#5440) * chore: bump sgl-kernel 0.0.9.post2 (sgl-project#5518) * use sglang_per_token_group_quant_fp8 from sgl-kernel instead of trion kernel (sgl-project#5473) Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> * fix kimi vl running bug after rebase main (sgl-project#5461) * fix bug of VLLM_AVAILABLE not defined (sgl-project#5497) * Avoid computing lse in Ragged Prefill when there's no prefix. (sgl-project#5476) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * [Model] Adding Qwen3 and Qwen3MoE (sgl-project#4693) * fix util import (sgl-project#5542) * Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… (sgl-project#5544) * chore: upgrade sgl-kernel 0.0.9.post2 (sgl-project#5540) * Fix DeepGEMM masked cannot be run on groups not being multiple or 4 (sgl-project#5340) * Make profiler output file names consistent (sgl-project#5548) * [PD] Tiny fix timeout error when generate (sgl-project#5545) * [PD] Fix no cache connect for recevier (sgl-project#5534) * feat: use flashinfer jit package (sgl-project#5547) * [PD] Remove the requirement of config file for mooncake backend (sgl-project#5460) * restruct compressed_tensors_w8a8_fp8 (sgl-project#5475) * simplify the control logic for using shared experts fusion (sgl-project#5504) * Remove one kernel in per_tensor_quant_mla_fp8 (sgl-project#5549) * Fix sampler nan check when calling top_k_top_p_sampling_from_probs (sgl-project#5546) * [PD] Support page size > 1 (sgl-project#5561) * fix hicache write back (sgl-project#5543) * Minor update for ROCm variable style (sgl-project#5562) * Fix bench_one_batch producing unnatural results for expert parallel (sgl-project#5149) * [perf] introduce deep gemm group_gemm_masked as bmm (sgl-project#5432) * [PD] Fix DeepSeek cannot be run on latest master (sgl-project#5568) * Fix BumpAllocator error when no input_ids (sgl-project#5564) * enable DeepSeek V3 shared_experts_fusion in sm90 (sgl-project#5571) * [Fix] fix outlines and xgrammar (sgl-project#4947) * [Doc]Add instruction for profiling with bench_one_batch (sgl-project#5581) * Release v0.4.5.post2 (sgl-project#5582) * Fix bench_serving fail when zero warmup requests (sgl-project#5574) * Fix DeepEP cannot run on latest master (sgl-project#5567) * Fix torch memory saver not enabled in DP scenario (sgl-project#5560) * Super tiny fix typo (sgl-project#5559) * Add document for LoRA serving (sgl-project#5521) * Tiny improve error message (sgl-project#5526) * [PD] Fix server crash when using batch requests (sgl-project#5531) * [Feat] upgrade pytorch2.6 (sgl-project#5417) * Fix enable chunked prefill for Llama4 (sgl-project#5575) * fix: use fa3 for gemma2 (sgl-project#5586) * Fix ChatCompletionMessageGenericParam to allow for None content (sgl-project#5452) * [PD] Fix large page size + chunk prefill (sgl-project#5588) * Add test config yamls for Deepseek v3 (sgl-project#5433) * [Feature] Prefill assistant response - add continue_final_message parameter (sgl-project#4226) Co-authored-by: Chayenne <zhaochen20@outlook.com> * add function call parser for DeepSeek V3 (sgl-project#5224) * smaller and non gated models for docs (sgl-project#5378) * Feat: Implement JSON Mode (response_format.type="json_object") (sgl-project#4733) Co-authored-by: Kyle Pena <kylepena@kyles-macbook-pro.turkey-marlin.ts.net> * check marlin format before attempting conversion (sgl-project#4675) * compressed_tensors: port w8a16 fp8 from vllm (sgl-project#4852) * Fix one more issue reported by torchfix (sgl-project#4859) * Add sanity check for max_running_requests (sgl-project#5016) * Correct grafana heatmap. (sgl-project#5019) * Perform Batch Tokenization. (sgl-project#5141) * Speedup shared expert weight construction by avoid cloning (sgl-project#5188) * Tiny add Engine.flush_cache API (sgl-project#5241) * [misc] remove is_cuda_available (sgl-project#5319) * Fix flush cache (sgl-project#5590) * Add Speculative Decoding Eagle3 topk > 1 (sgl-project#5318) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Yubo Wang <yubowang2019@gmail.com> * upstream hicache fixes (sgl-project#5570) * Tiny add warning when cannot recognize bool env var (sgl-project#5348) * Modify metrics service endpoint (sgl-project#3443) * Update protocol.py to fix sgl-project#4589 (sgl-project#4590) * [Feat.] Enable grafana to show metrics (sgl-project#4718) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> * [Fix] Enhance DP Attention for IPv6 Compatibility (sgl-project#4937) * Support o1 model on Azure (sgl-project#4980) Co-authored-by: Shan Yu <shanyu1@g.ucla.edu> * Tiny remove duplicated code (sgl-project#5021) * Tiny update error hint (sgl-project#5037) * Support PD bootstrap fields on /v1/chat/completions endpoint (sgl-project#5488) * [PD] Fix generate endpoint of min_lb for PD (sgl-project#5598) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [PD] Fix edge case and simplify large page size + chunked prefill (sgl-project#5589) * [PD] Add NIXL transfer backend (sgl-project#5477) * [PD] Support decode overlap schedule (sgl-project#5608) * [PD] Support prefill overlap + Ensure no race condition (sgl-project#5609) * Enhance GPU memory settings (sgl-project#5604) * [feature] enable pre compile jit deep_gemm (sgl-project#5580) * Clean up mem settings (sgl-project#5610) * Support aiter RMSNorm in AMD (sgl-project#5510) Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> * chore: bump v0.4.5.post3 (sgl-project#5611) * Remove extra copy in deepseek forward absorb (sgl-project#5578) Co-authored-by: saienduri <saimanas.enduri@amd.com> * [Doc] Fix a 404 link to llama-405b (sgl-project#5615) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * [fix] force use deepgemm in compile_deep_gemm (sgl-project#5618) * [fix] fix compile_deep_gemm missing kv_b_proj (sgl-project#5620) * fix: gemma 3 not use softcap (sgl-project#5622) * Fix FA3 DeepSeek prefill performance regression (sgl-project#5624) Co-authored-by: ispobock <ispobaoke@gmail.com> * [NFC] Remove duplicate `compressed-tensors` (sgl-project#5640) * Fix shared experts fusion error without quantization (sgl-project#5632) * [feature] Add H20 fp8_w8a8 FusedMoE config for --n-share-experts-fusion=16 (sgl-project#5641) Co-authored-by: yuethe <yuethe@tencent.com> * fix flashmla bug (sgl-project#5272) * [fix] reduce dp capture bs (sgl-project#5634) Co-authored-by: alcanerian <alcanerian@gmail.com> * Remove q concat in FA3 backend for DeepSeek decode (sgl-project#5638) * Revert "Support aiter RMSNorm in AMD" (sgl-project#5646) * fix: update bench_speculative (sgl-project#5649) * Turn on DeepGemm By Default and Update Doc (sgl-project#5628) * Fuse q_a_proj and kv_a_proj (sgl-project#5619) * Remove unnecessary `torch.full` in DeepSeek (sgl-project#5601) * [1/2] Add FP8 Blockscale MoE CUTLASS kernel for Blackwell (sgl-project#5281) * fix sgl-kernel unit tests (sgl-project#5666) * fix awq_dequantize import (sgl-project#5669) * Integrating PD disaggregation with DP attention and DeepEP (sgl-project#5435) Co-authored-by: Byron Hsu <byronhsu1230@gmail.com> * fix gemma3 unit test (sgl-project#5670) * fix torchvision::nms not exist (sgl-project#5671) * [PD] Add support for dp attention with mooncake (sgl-project#5530) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * tune the threshold of gemma-2-27b-it in test_nightly_gsm8k_eval.py (sgl-project#5677) * [Doc] Fix two 404 links caused by sglang typo (sgl-project#5667) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * fix: update truss bench_serving (sgl-project#5683) * fix: only compile ApplyTokenBitmaskInplace cu124+ (sgl-project#5686) * chore: bump sgl-kernel 0.1.0 (sgl-project#5688) * vlm: enable radix cache for qwen-vl models (sgl-project#5349) Co-authored-by: Xinyuan Tong <justinning0323@outlook.com> * [BugFix] Fix combination of MTP and `--n-share-experts-fusion`with R1 (sgl-project#5707) * Fix weight loading bug for Deepseek v3+nextn (sgl-project#5684) * Add example to use sgl engine with fastapi (sgl-project#5648) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * [Doc] Fix a link to Weilin Zhao (sgl-project#5706) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * Add MMMU benchmark results (sgl-project#4491) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * [Model] Support `ArcticForCausalLM` architecture (Snowflake/snowflake-arctic-instruct) (sgl-project#5078) Co-authored-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com> * [PD] Better logs (sgl-project#5715) * [PD] Add kvargs table and thread pool for kvcache sender of mooncake (sgl-project#5738) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [PD]: Support Muti Prefill in one node (sgl-project#5704) Co-authored-by: shuaills <shishuaiuoe@gmail.com> * Fix: deepseek forward absorb (sgl-project#5723) Co-authored-by: ispobock <ispobaoke@163.com> * Pin torch audio to 2.6.0 (sgl-project#5750) * Revert "[Model] Support `ArcticForCausalLM` architecture (Snowflake/snowflake-arctic-instruct)" (sgl-project#5754) * Disable flaky eagle tests (sgl-project#5753) * update triton 3.2.0 h200 fused moe triton config and add warning about triton fused_moe_kernel performance degradation due to different Triton versions. (sgl-project#5740) * [Docs] Update runtime/engine/readme.md (sgl-project#5737) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * Reorder loop in shared expert weight loading (sgl-project#5719) * fix: fix one more bug from merging mm_inputs (sgl-project#5718) Co-authored-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> * [Fix]: support deepseek-vl2-tiny model (sgl-project#5552) Co-authored-by: bppps <zouyu.zzx@alibaba-inc.com> * Bugfix for minicpmo vision test (sgl-project#5760) * [Minor] fix documentations (sgl-project#5756) * Add an assertion to enhance the robustness of the operator (sgl-project#5736) * fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512 (sgl-project#5733) * Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (sgl-project#5728) * [fix] fix potential bumpy throughtput with deepgemm (sgl-project#5722) * Resolves the `404 Not Found` error when running `compile_deep_gemm.py` in multi-node setups (sgl-project#5720) * perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (sgl-project#5716) * we fix the non existent access of `decrypted_config_file` (sgl-project#5685) * CI: rewrite test_vision_chunked_prefill to speedup (sgl-project#5682) * Fuse MLA set kv cache kernel (sgl-project#5748) * Update amd docker image to `sglang:v0.4.5.post3-rocm630`. (sgl-project#5697) * [feature] support for roberta embedding models (sgl-project#5730) * [fix] fix bench_one_batch_server (sgl-project#5607) * support for the DeepSeek model by enabling streaming response parsing (sgl-project#5592) * fix: Use `is not None` instead of `!= None` for None checks. (sgl-project#5687) * Add Llama 4 to FA3 test (sgl-project#5509) * [misc] more decode step log for batch_one_batch (sgl-project#5565) * Handle JSONDecodeError while processing request data (sgl-project#5599) * fix(srt): check if sample_indices is not None before usage. (sgl-project#5633) * update llguidance to 0.7.11; adds StructTag (sgl-project#4870) * Use sgl-kernel sgl_per_token_group_quant_int8 (sgl-project#4971) * Add memory_saver check (sgl-project#4986) Signed-off-by: Kebe <mail@kebe7jun.com> * add switch to disable open api doc (sgl-project#3744) Signed-off-by: congcongke <zhanweidu@163.com> * Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (sgl-project#5772) * Fix eagle test case (sgl-project#5776) * Split local attention test from fa3 test (sgl-project#5774) * Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (sgl-project#5777) * Simplify FA3 tests (sgl-project#5779) * Revert "[fix] fix bench_one_batch_server" (sgl-project#5785) * Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (sgl-project#5786) * [CI] Tune threshold (sgl-project#5787) * [CI] fix port conflicts (sgl-project#5789) * [CI] Fix ci tests (sgl-project#5769) * [PD]Reduce kv transfer threads (sgl-project#5791) * [CI] Fix test case (sgl-project#5790) * Add 8-GPU Test for Deepseek-V3 (sgl-project#5691) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * Release v0.4.6 (sgl-project#5795) * Update nightly-test.yml (sgl-project#5797) * [CI] Improve github summary & enable fa3 for more models (sgl-project#5796) * [Docs] update grafana setup guide in production metrics (sgl-project#5643) Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com> * [Misc] add structure logging, write to file and log tracing for SGL Router * Improve overlap scheduling (sgl-project#5788) * Add Cutlass MLA attention backend (sgl-project#5390) * chore: upgrade sgl-kernel 0.1.0 (sgl-project#5690) * Dockerfile.dev pip scikit_build_core (sgl-project#5807) * Add a doc to fix sgl-kernel build link error in py39 with ccache (sgl-project#5809) * Turn on overlap scheduler for multimodal models (sgl-project#5771) * Tiny refactor DefaultModelLoader.Source (sgl-project#5482) * [Docs] Replace lists with tables for cleanup and readability in server_arguments (sgl-project#5276) * Revert "Tiny refactor DefaultModelLoader.Source" (sgl-project#5825) * Feat: add support for thinking mode via chat_template_kwargs.enable_t… (sgl-project#5551) Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> * fix: fix the error where the content is None when reasoning and tool … (sgl-project#5838) * feat: Add fused moe triton config for qwen3 moe on h100 (sgl-project#5833) * fused moe triton tuning script support qwen3 (sgl-project#5842) * feat: Add fused moe triton config for qwen3bf16 moe on h20 (sgl-project#5839) * [PD] support pd fake transfer for warmup (sgl-project#5726) * [config] qwen3moe_tune_h20 fp8 tp4 (sgl-project#5846) * [Doc] Recover history of server_arguments.md (sgl-project#5851) * feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (sgl-project#5850) * [CI] test chunked prefill more (sgl-project#5798) * ROCm: update AITER (sgl-project#5816) * [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (sgl-project#5847) Co-authored-by: sighingnow <sighingnow@gmail.com> * [Fix] Missing bootstrap_port field (sgl-project#5823) * feat: update is_fa3_default_architecture (sgl-project#5854) * add fused moe config for qwen3moe fp8/bf16 (sgl-project#5849) * chore: bump v0.4.6.post1 (sgl-project#5845) * fix for hpu backend in model runner and server args Signed-off-by: Mohit Sinha <msinha@habana.ai> * rebase formatting issue Signed-off-by: Mohit Sinha <msinha@habana.ai> * [SW-228218]: Fix device mismatch in frequency penalty. Ensure tensors in BatchedFrequencyPenalizer are on the same device by moving output_ids and frequency_penalties to the device of cumulated_frequency_penalties. This resolves a RuntimeError caused by tensors on cpu and hpu:0 during logits subtraction. --------- Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Signed-off-by: windsonsea <haifeng.yao@daocloud.io> Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: congcongke <zhanweidu@163.com> Signed-off-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Zhaoyang Hao <77828610+Muuuchen@users.noreply.github.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: lambert0312 <lambert80.ios@gmail.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: ybyang <ybyang7@iflytek.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: mRSun15 <3150105645@zju.edu.cn> Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com> Co-authored-by: Yuhao Yang <yyh073@foxmail.com> Co-authored-by: Michael Yao <haifeng.yao@daocloud.io> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: BearBiscuit <55008898+BearBiscuit05@users.noreply.github.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Didier Durand <durand.didier@gmail.com> Co-authored-by: woodx <124784234+woodx9@users.noreply.github.com> Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: PGFLMG <1106310035@qq.com> Co-authored-by: u4lr451 <u4lr451@gmail.com> Co-authored-by: ocss884 <ocss.lin@gmail.com> Co-authored-by: Michael Feil <63565275+michaelfeil@users.noreply.github.com> Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com> Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> Co-authored-by: liwenju0 <like4hub@gmail.com> Co-authored-by: Wenxuan Tan <wtan45@wisc.edu> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Byron Hsu <byronhsu1230@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: Zhaoyi Li <36555117+Lzy17@users.noreply.github.com> Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com> Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com> Co-authored-by: AmadeusW <41280211+Amadeus-Winarto@users.noreply.github.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Yi Zhou <zhouyi920521@gmail.com> Co-authored-by: simveit <69345428+simveit@users.noreply.github.com> Co-authored-by: kyle-pena-kuzco <kyle.pena@kuzco.xyz> Co-authored-by: Kyle Pena <kylepena@kyles-macbook-pro.turkey-marlin.ts.net> Co-authored-by: Enrique Shockwave <33002121+qeternity@users.noreply.github.com> Co-authored-by: Juwan Yoo <ryan@tmfi.us> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: mac0ne <mac0ne@users.noreply.github.com> Co-authored-by: Sundara Raman Ramachandran <sundar24295@gmail.com> Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: moontidef <53668275+relic-yuexi@users.noreply.github.com> Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com> Co-authored-by: Lucius <souzou@foxmail.com> Co-authored-by: Chuyue Sun <33578456+ChuyueSun@users.noreply.github.com> Co-authored-by: Shan Yu <shanyu1@g.ucla.edu> Co-authored-by: Yongtong Wu <914554688@qq.com> Co-authored-by: michael-amd <Michael.Zhang@amd.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: Connector Switch <c8ef@outlook.com> Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com> Co-authored-by: yuethe <yuethe@tencent.com> Co-authored-by: alcanerian <alcanerian@gmail.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> Co-authored-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com> Co-authored-by: IAN <50618241+hcyz33@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: ZXN <44322223+bppps@users.noreply.github.com> Co-authored-by: bppps <zouyu.zzx@alibaba-inc.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com> Co-authored-by: vzed <207368749+vincentzed@users.noreply.github.com> Co-authored-by: DavidBao <121073073+DavidBao03@users.noreply.github.com> Co-authored-by: Frankey_8080 <32973306+Frank-Jie@users.noreply.github.com> Co-authored-by: yan97ao <580776+yan97ao@users.noreply.github.com> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Michał Moskal <michal@moskal.me> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: zhanweidu <zhanweidu@163.com> Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> Co-authored-by: JiLi <leege233@gmail.com> Co-authored-by: sighingnow <sighingnow@gmail.com> Co-authored-by: XTY <xutianyi1999@live.com> Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai>

CSEEduanyu · 2025-05-20T13:33:59Z

If there is cross-machine transmission, is GPU direct or NVSHMEM used in the current scheme? @trevor-m

msharmavikram · 2025-05-20T14:07:46Z

As of today we are using whatever supported in UCX.

(So not using nvshmem currently but still goes through nvlink. If you are interested in gpu initiated transfers, can you provide details on the use case. )

CSEEduanyu · 2025-05-20T14:24:28Z

As of today we are using whatever supported in UCX.

(So not using nvshmem currently but still goes through nvlink. If you are interested in gpu initiated transfers, can you provide details on the use case. )

So if kv_cache is transferred between two different machines, does it need to go through the CPU memory as an intermediary?Will the performance be relatively poor in this case? @msharmavikram

msharmavikram · 2025-05-23T21:31:39Z

It should not. NIXL picks the fastest path to transfer between the worker instances.

trevor-m requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners April 16, 2025 18:29

trevor-m mentioned this pull request Apr 16, 2025

Draft: [PD] NIXL Integration #5006

Closed

6 tasks

trevor-m force-pushed the nixl-2 branch from 610c93b to e50798d Compare April 16, 2025 21:10

trevor-m mentioned this pull request Apr 16, 2025

[Roadmap] Prefill and Decoding Disaggregation #4655

Open

13 tasks

jokerwyt reviewed Apr 17, 2025

View reviewed changes

trevor-m force-pushed the nixl-2 branch from 7b793c3 to 69b3af4 Compare April 17, 2025 21:49

jokerwyt reviewed Apr 18, 2025

View reviewed changes

trevor-m added 7 commits April 18, 2025 20:28

Add NIXL disaggregation transfer backend

6d77a6e

use check_remote_xfer_status. hangs

8736437

Working

422c134

Cleanup

af6bb9f

Support chunked prefill

e7d474a

Fix concurrent request notif handling

45c59bf

Fix concurrent requests on sender side

983bdb5

trevor-m force-pushed the nixl-2 branch from 69b3af4 to a7fa5f5 Compare April 18, 2025 20:35

Remove cleanup from get_transfer_status(). Use TransferStatus class

5a9b10a

trevor-m force-pushed the nixl-2 branch from a7fa5f5 to 5a9b10a Compare April 19, 2025 00:19

ByronHsu approved these changes Apr 20, 2025

View reviewed changes

Merge branch 'main' into nixl-2

d8a22f2

merrymercy added the high priority label Apr 21, 2025

hnyls2002 merged commit 4dce1cc into sgl-project:main Apr 21, 2025
18 of 23 checks passed

RunkaiTao pushed a commit to Pb314314/sglang that referenced this pull request Apr 23, 2025

[PD] Add NIXL transfer backend (sgl-project#5477)

7ad6cf7

jokerwyt mentioned this pull request Apr 24, 2025

[PD] NIXL backend Prefill TP & Decode TP+DP #5681

Merged

6 tasks

[PD] Add NIXL transfer backend #5477

[PD] Add NIXL transfer backend #5477

Conversation

trevor-m commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Installing NIXL

Example usage:

Example usage with tp-size 2:

Modifications

Benchmark

Checklist

Uh oh!

trevor-m commented Apr 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trevor-m Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jokerwyt commented Apr 18, 2025

Uh oh!

jokerwyt Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jokerwyt Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msharmavikram commented Apr 21, 2025

Uh oh!

Uh oh!

ayrnb commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CSEEduanyu commented May 20, 2025

Uh oh!

msharmavikram commented May 20, 2025

Uh oh!

CSEEduanyu commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msharmavikram commented May 23, 2025

Uh oh!

Uh oh!

trevor-m commented Apr 16, 2025 •

edited

Loading

trevor-m Apr 17, 2025 •

edited

Loading

jokerwyt Apr 18, 2025 •

edited

Loading

jokerwyt Apr 19, 2025 •

edited

Loading

ayrnb commented May 12, 2025 •

edited

Loading

CSEEduanyu commented May 20, 2025 •

edited

Loading