Highlights

@ShangmingCai

What's Changed

[PD] Use batch transfer for rdma transport and add notes for mnnvl usage by @ShangmingCai in #8595
[bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. by @sighingnow in #8611
Fix hf3fs_fuse import error by @ispobock in #8623
Update step3v default config by @ispobock in #8626
[ci] fix genai-bench execution cmd by @slin1237 in #8629
[router] update router pypi version by @slin1237 in #8628
[Optimization][Perf] Disable the GC during CUDA graph capture to speed up by up to 3x by @b8zhong in #8577
Fix typos in py_test/test_launch_server.py by @windsonsea in #6227
misc: Remove debug print to logger.info by @CatherineSue in #8633
SGLang HiCache NIXL Connector by @vvenkates27 in #8488
[bug] remove pdlb from minilb since its no longer available by @slin1237 in #8634
[bugfix] Fix flashinfer cutlass EP moe after MoE refactor by @trevor-m in #8630
Conditionally import HiCacheHF3FS by @pansicheng in #8598
TRTLLM Gen MLA Decode Kernel Integration (same as #7938) by @farazkh80 in #8632
Fix nan value generated after custom all reduce by @kkHuang-amd in #8532
Revert "Fix nan value generated after custom all reduce (#8532)" by @zhyncs in #8642
Feature/modelscope model download by @yrk111222 in #8083
chore: speedup NPU CI by cache by @pkking in #8270
[Bugfix] fix w8a8_int8 load issue by @iforgetmyname in #8308
[bugfix] fix router python parser for pd urls by @slin1237 in #8644
[router] add basic usage doc by @slin1237 in #8640
[router] upgrade router version to 0.1.8 by @slin1237 in #8645
[NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE by @kaixih in #8450
HiCache, fixing hash value indexing by @xiezhq-hermann in #8636
Interface change for kvcache io to support page first layout by @xiezhq-hermann in #8318
Update batch size limitation of dsv3_router_gemm kernel to 16 by @Fridge003 in #8051
chore: bump v0.4.10.post1 by @ispobock in #8652
Add hf3fs_utils.cpp to package-data by @pansicheng in #8653
Fix chat template handling for OpenAI serving by @JustinTong0323 in #8635
Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… by @byjiang1996 in #8511
[5/N] MoE Refactor: Update MoE parallelism arguments by @ch-wan in #8658
Increase tolerance to address CI failures by @lifuhuang in #8643
[Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 by @panpan0000 in #8013
[Doc] fix: Update README for cu126 sgl-kernel compile problem by @Hongbosherlock in #8665
fix per token cuda kernel hidden dim cannot divide by 16 by @hebiao064 in #8543
fix arg typo for --disaggregation-transfer-backend by @ZacWang in #8664
[fix] fix pd disagg error of vlms by @ccw1996 in #8094
Disable tp for shared experts under expert parallelism for GLM4.5 model (#8647) by @zminglei in #8647
[bugfix] Fix page size for create_flashmla_kv_indices_triton() for cutlass mla by @trevor-m in #8685
[bug] limit bootstrap room to to [0, 2^63 - 1] by @slin1237 in #8684
Update CODEOWNERS by @merrymercy in #8686
Fix deepgemm masked grouped gemm jit compile by @ispobock in #8679
Fix FP8 block quantization when N or K is not multiples of 128 by @yanbing-j in #8648
bugfix(hicache): Fix 'MooncakeStore' not defined error. by @hzh0425 in #8668
upgrade xgrammar 0.1.22 by @Swipe4057 in #8522
[bugfix] Add 'disaggregation_mode' parameter to warmup function when compile deep_gemm manually by @lbh2001 in #8618
Add support for NCCL symmetric memory for TP allreduces by @nvcastet in #8238
[1/2] sgl-kernel: Fuse routed scaling factor into select_experts by @trevor-m in #8364
chore(gb200): update dockerfile to handle fp4 disaggregation by @ishandhanani in #8694
[bugfix] Apply routed scaling factor to cutlass_fused_experts_fp8 by @trevor-m in #8688
Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled by @GaoYusong in #7434
model: adapt mllama4 to VisionAttention by @wenchen76 in #8512
Add tensor.detach() back to update weight util by @hebiao064 in #8691
[Doc] Polish sgl-kernel readme for cu126 build error by @FlamingoPg in #8704
Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_experts" by @hnyls2002 in #8706
[router] minor code clean up and and refactoring by @slin1237 in #8711
[Bug] fix green context's incompatibility with cuda < 12.4 by @hnyls2002 in #8701
chore: bump sgl-kernel v0.2.9 by @zhyncs in #8713
Remove assertions about per group quant fp8 by @fzyzcjy in #8717
[FIX] Fix the nightly CI by disabling swa mem pool for gemma2 by @merrymercy in #8693
Fix triton moe error caused by TopK refactor by @fzyzcjy in #8705
[router] Implement HTTP Dependency Injection Pattern for Router System by @slin1237 in #8714
[Feature] Radix Tree in C++ by @DarkSharpness in #7369
[Perf]Use Cooperative Schedule for H100 & H200 & H800 in fp8_blockwise_scaled_grouped_mm by @HydraQYH in #8722
Fix fused MoE when routed_scaling_factor is None by @hnyls2002 in #8709
Tiny fix CI pytest error by @fzyzcjy in #8524
[hotfix] fix mixtral with tensor-level compressed-tensor quantization by @ch-wan in #8721
Support limiting max loaded loras in CPU. by @lifuhuang in #8650
Reduce memory accumulation in long-running server by @Edenzzzz in #8306
HiCache storage, style change and bug fix by @xiezhq-hermann in #8719
[feat] support minimum token load balance in dp attention by @WANG-GH in #7379
Do layernorm before allgather for DP attention by @trevor-m in #8631
[fix] Fix divide by zero error for llama4. by @shenoyvvarun in #8683
feat: Add new moe triton for NVIDIA RTX 6000 Ada by @17Reset in #8547
[Improvements] Merge health check route by @whybeyoung in #8444
chore: bump sgl-kernel 0.3.0 with torch 2.8.0 by @zhyncs in #8718
Save cuda graph memory for fa3 by @ch-wan in #8567
[CUDA Graph] save cuda graph memory by using next_token_logits_buffer by @ch-wan in #8579
[DP] fix the compatibility issue between DP attention and --attention-backend triton by @ch-wan in #8723
chore: bump v0.4.10.post2 by @zhyncs in #8727
feat: Support DP Attention for step3_vl by @yhyang201 in #8699
[RL] fix update weight for FusedMoE with EP by @zhuzilin in #8676
use fp32 for e_score_correction_bias in GLM-4.5 by @zRzRzRzRzRzRzR in #8729
Fix triton kernels topk with keyword arguments by @ispobock in https://github.com/sgl-project/sglang/pull/...

@alexsun07

Highlights

This is a regular release with many new optimizations, features, and fixes. Please checkout the following exciting roadmaps and blogs

Please check the 2025 H2 roadmap #7736
GLM-4.5 Meets SGLang: Reasoning, Coding, and Agentic Abilities https://lmsys.org/blog/2025-07-31-glm4-5/
SpecForge: Accelerating Speculative Decoding Training for SGLang https://lmsys.org/blog/2025-07-25-spec-forge/
Deploying Kimi K2 with PD Disaggregation and Large-Scale Expert Parallelism on 128 H200 GPUs
https://lmsys.org/blog/2025-07-20-k2-large-scale-ep/
Accelerating SGLang with Multiple Token Prediction https://lmsys.org/blog/2025-07-17-mtp/
How to support new VLMs into SGLang: A Case Study with NVILA https://lmsys.org/blog/2025-07-16-nvila/
Cost Effective Deployment of DeepSeek R1 with Intel® Xeon® 6 CPU on SGLang https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/
slime: An SGLang-Native Post-Training Framework for RL Scaling https://lmsys.org/blog/2025-07-09-slime/

What's Changed

[AMD] add aiter fused moe in DeepEP path by @alexsun07 in #7268
enable aiter_biased_grouped_topk kernel by @valarLip in #7423
[PD Disaggregation] replace transfer with batch transfer for better performance by @ssssnow in #7236
Remove cumsum_buffer initilization by @ispobock in #7439
[benchmark] fbgemm benchmark support bandwidth report and support fbgemm_cutlass_gmm by @BBuf in #7422
Support multi-thread model weight loading by @xianzhiT in #7277
[PD] NIXL: Register kv args in advance and cleanup finished requests by @trevor-m in #6717
fix: Add --model as an alias for --model-path in server_args by @CatherineSue in #7505
misc: Improvement to serving_chat.py and add more ut by @CatherineSue in #7489
Fuse sorted_token_ids padding to moe_align_block_size kernel by @ispobock in #7437
[OAI] patch origin request_id logic by @whybeyoung in #7508
[PD][Spec] Fix hidden state transfer for spec decode by @ShangmingCai in #7516
EPLB support for MTP by @yilian49 in #7510
clean duplicate code by @habaohaba in #7512
[ci] add router benchmark script and CI by @slin1237 in #7498
fix: force synchronization between TP workers when update_weights by @dangkai4u in #6626
[CPU] [BF16] Call fused_experts_cpu, weight_packed_linear and bmm_cpu kernel in DeepSeek model by @chunyuan-w in #6641
[CI] Upgrade mooncake to v0.3.4.post2 to fix potential slice failed bug by @ShangmingCai in #7522
npu fused op by @ll819214 in #7386
feat: send kvmetrics from sglang scheduler by @zixuanzhang226 in #6721
[PD] Add different TP sizes support for no-MLA models by @Hongbosherlock in #6793
enable aiter fp8 blockscale quant by @valarLip in #7520
take aiter get_rope back by @valarLip in #7521
Fix typo of flash_cache by @hebiao064 in #7513
feat: add return hidden_states at async generation by @yyihuang in #7507
minor: 'role' must be system/assistant/tool, but case insensitive for now by @minleminzui in #7499
Fix FP8 KV Cache Support in FA3 Backend by @guoyuhong in #7148
Fix gathered_buffer issues in tbo by @Qiaolin-Yu in #7531
[PD] Raise error for incompatible mooncake version and some minor fixes by @ShangmingCai in #7527
[CMake] Fix sgl-kernel CMakeLists for Blackwell by @MasterJH5574 in #7543
Add Tencent HunYuanMoEV1 model support by @mpjlu in #7549
Update seed in CPU UTs to avoid flaky failure with single test by @yanbing-j in #7544
chore: improve ci bug reporting by @mickqian in #7542
chore: remove vlm unnecessary import by @JustinTong0323 in #7541
chore: bump v0.4.8.post1 by @zhyncs in #7559
[PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND by @trevor-m in #7330
[Fix] incorrect assert in EPLB by @ch-wan in #7575
Updates Gemma3n MLP layer to adapt latest transformers version by @JustinTong0323 in #7573
Fix MTP error when enabling two-batch overlap by @fzyzcjy in #7569
Add e2e test for multi instance multi stage memory release/resume occupuation by @MrAta in #7208
[CI] Add CI Testing for Prefill-Decode Disaggregation with Router by @key4ng in #7540
Updates transformers and timm dependencies by @JustinTong0323 in #7577
feat: support compatibility between MTP and two-batch-overlap by @Qiaolin-Yu in #7225
Move multimodal processors into a separate folder by @merrymercy in #7581
Fix broken CI TestVILAServer by @lifuhuang in #7610
[router] add centralized configuration module for sgl-router by @slin1237 in #7588
Fix: Minicpm by @JustinTong0323 in #7612
Hybrid kv cache for LLaMA4 by @tarinkk in #6563
[CPU] add optimizations for INT8 and FP8 DeepSeek by @chunyuan-w in #6769
Tiny add logs for expert location updater by @fzyzcjy in #7308
Fix flakiness in LoRA batch test. by @lifuhuang in #7552
[BUG] fix local_rank in initialize_dp_attention by @TomQuartz in #7584
Support dynamic LoRA loading / unloading in engine/server API by @lifuhuang in #7446
[PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated by @ShangmingCai in #7598
fix unit tests by @zhyncs in #7618
Let ep_scatter support arbitrary strides / ue8m0 format by @fzyzcjy in #7309
Let EP prefill support new DeepGEMM by @fzyzcjy in #7310
docs: add gb200 nvl72 and a16z grant by @zhyncs in #7620
Adds support for OpenAI chat completions API in bench_serving by @JustinTong0323 in #7036
[bugfix] Remove PR comment posting from Rust benchmark workflow by @slin1237 in #7625
[Minor] clean up multimodal processor and tokenizer manager by @merrymercy in #7624
Add dsv3 fused a gemm to sgl-kernel by @ispobock in #7630
Add @mickqian as the CODEOWNERS of multimodal by @merrymercy in #7636
Fix stream reasoning parser and Adds Kimi reasoning parser by @JustinTong0323 in #7432
Fix sgl-router startup crash by @finetunej in #7619
[bugfix] fix runtime dropping panic in editable by @slin1237 in #7628
Move files related to EPLB by @fzyzcjy in #7580
[misc] reduce weird rope_scaling_factor warning by @Alcanderian in #7176
[AMD] Add unit-test-sgl-kernel-amd to AMD CI by @hubertlu-tw in #7539
Update CODEOWNERS by @merrymercy in #7640
[EAGLE] remove a wrong adjustment for page_size > 1 & topk > 1 in server_args.py by @merrymercy in #7643
[CPU] add c++ kernel to bind CPU cores and memory node by @chunyuan-w in #7524
Improve streaming, log_level, memory report, weight loading, and benchmark script by @merrymercy in #7632
Add dsv3 router gemm kernel by @Fridge003 in #7627
chore: upgrade flashinfer v0.2.7 jit by @zhyncs in #7663
[doc] update lws doc for pd by @whybeyoung in #7318
Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes by @narutolhy in https://github.com/sgl-project...

@slin1237

Highlights

OpenAI-Compatible Server Refactor

Re-structured the OpenAI-compatible server to support production and enterprise environments. Key improvements include:

Consistent metrics and logging for better observability and debugging.
Unified error handling, request validation, and processing logic for improved reliability and maintainability.
Improved request tracking across sessions and components.
Fixed bugs in embedding requests and reasoning parsers.

This work was a collaborative effort involving engineers from academic and industry institutions. Special thanks to the Oracle Cloud team and the SGLang team and community — including @slin1237, @CatherineSue, @key4ng, @JustinTong0323, @jhinpan, @yhyang201, @woodx9 and @whybeyoung — for their invaluable contributions.

DeepSeek R1 FP4 on Blackwell GPU

Added support for DeepSeek R1 with FP4 and MTP on NVIDIA Blackwell GPU.

Integrated FlashInfer NVFP4 MoE, supporting TP, EP, and DP.
Supported 2-stream shared expert execution.
Achieved up to 90 TPS per user at isl/osl/bs = 1k/1k/16 on B200.

Further optimization in progress. Special thanks to the FlashInfer, NVIDIA Enterprise Products, Novita AI, DataCrunch, Google Cloud, and SGLang teams — especially @Alcanderian and @pyc96 — for their critical contributions.

Breaking Change: OpenAI-Compatible API Module Moved

The sglang/srt/openai_api directory has been removed and replaced with sglang/srt/entrypoints/openai.

Update your imports to the new module path. For example:

- from sglang.srt.openai_api.protocol import Tool
+ from sglang.srt.entrypoints.openai.protocol import Tool

What's Changed

Update README.md by @merrymercy in #7040
[Docker] Upgrading base image from 24.04 to 24.12 by @Swipe4057 in #7043
fix 24.12 docker by @zhyncs in #7045
Minor cleanup of fa3 backend by @merrymercy in #6999
Fix eagle on AMD by @merrymercy in #7051
Clean up server_args.py by @merrymercy in #7037
Minor style fix in cuda_graph_runner.py by @merrymercy in #7053
[WA] fix output data is nan in CI test "test_moe_eval_accuracy_large.py" by @kkHuang-amd in #7021
[fix] libmlx5.so already in base image by @HanHan009527 in #7060
Fix test_lora.py CI by @Fridge003 in #7061
Tiny fix cutlass_mla_get_workspace_size stub incorrect signature by @fzyzcjy in #7057
Add sanity checks when a test file is not added to CI by @fzyzcjy in #6947
Revert "Add sanity checks when a test file is not added to CI (#6947)" by @zhyncs in #7063
Fix missing tool call id if tool call index >0 in streaming tool call output. by @Xu-Wenqing in #7049
chore: update dev docker by @zhyncs in #7064
Open AI API hidden states by @kyle-pena-kuzco in #6716
fix arm sgl-kernel link issue by @zhyncs in #7066
[Feature] Add Logit Bias by @b8zhong in #6579
Improve perf tuning docs by @merrymercy in #7071
Frontend language separate reasoning support by @binarycrayon in #6031
Do not run frontend_reasoning.ipynb to reduce the CI load by @merrymercy in #7073
Simplify the heuristics for setting --mem-fraction-static by @merrymercy in #7054
update doc by @Ximingwang-09 in #7046
Clean up docs for server args and sampling parameters (generated by grok) by @merrymercy in #7076
Fix GGuf and add back test_gguf.py by @Fridge003 in #7067
vlm: adapt internvl to VisionAttention by @mickqian in #6870
Fix circular import in test_prefix_chunk_info.py by @Fridge003 in #7097
Fix misusing the "_is_cuda". by @sogalin in #7091
Support VILA models by @futrime in #6106
[FIX]remove redundant code in logits_processor.py by @pc-neo in #7079
[feat]: Emit fixed-size KV blocks events by @faradawn in #6824
[Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations by @lifuhuang in #6994
Fix positional argument by @liquanfeng in #7093
[sgl-kernel] Add cuda kernel for moe_ep_silu_and_mul by @yuan-luo in #6919
Improve log status by @hnyls2002 in #7115
feat: update blackwell setup by @zhyncs in #7119
Update CODEOWNERS by @merrymercy in #7126
Add gfx950 support for sgl-kernel. by @sogalin in #7092
[Fix] Reduce busy polling when scheduler is idle by @p12tic in #6026
Minor add utility to read expert distribution recorder output by @fzyzcjy in #7134
Remove unnecessary metadata_expand.max_seq_len_k operations in fa3 to… by @byjiang1996 in #7140
Minor speedup topk postprocessing by @fzyzcjy in #7058
filter by num_hidden_layers by @pansicheng in #7056
Remove 200us slow concat kernel (part 1: kernel) by @fzyzcjy in #7145
Support new DeepGEMM format in per token group quant by @fzyzcjy in #7146
chore: bump v0.1.8.post1 by @zhyncs in #7152
Support new DeepGEMM format in per token group quant (part 2: srt) by @fzyzcjy in #7155
Fix DeepEP error in some environments by @fzyzcjy in #7154
Minor speed up block_quant_dequant by @fzyzcjy in #6814
Tiny add sanity checks for DeepGEMM inputs by @fzyzcjy in #7157
Remove 200us slow concat kernel (part 2: srt) by @fzyzcjy in #7020
Re-quantize DeepSeek model weights to support DeepGEMM new input format by @fzyzcjy in #7156
Minor style change of triton backend by @merrymercy in #7165
Split the eagle test into two files by @merrymercy in #7170
Support new DeepGEMM input format in silu_and_mul_masked_post_quant_fwd by @fzyzcjy in #7153
Refactor DeepGEMM integration by @fzyzcjy in #7150
Add test for refactored openai server by @jhinpan in #7161
Improve test cases for eagle infer by @merrymercy in #7173
Support new DeepGEMM by @fzyzcjy in #7172
Increase timeout in test/srt/test_disaggregation.py by @merrymercy in #7175
Add Phi-4-mm to supported VLM supported model list. by @lifuhuang in #7178
Fix shared experts fusion + weight requant by @fzyzcjy in #7177
[fix] fix dsv3 weight loader tqdm and simplify shared experts fusion by @Alcanderian in #7181
[fix] fix cutlass_mla_backend with cuda_graph and add sm_scale for sgl-kernel cutlass_mla by @Alcanderian in #7184
[PD] Update prefill.py by @ByronHsu in #7190
Fix a minor bug related to DeepGEMM upgrade by @zhijian-liu in #7191
chore: bump v0.1.8.post2 by @zhyncs in #7189
[fix] fix determine_num_fused_shared_experts by @Alcanderian in #7180
chore: upgrade sgl-kernel v0.1.8.post2 by @Alcanderian in #7186
Fix NCCL 2.27.3 not in docker image by @fzyzcjy in #7195
Fix error when disabling new DeepGEMM by @fzyzcjy in #7198
[PD] Support decode retract and update decode.py by @ByronHsu in #7196
Move host memory pools into a separate file by @merrymercy in #7200
Lianmin/simplify memory pool by @merrymercy in #7202
Fix grammar abort & Minor style fixes by @merrymercy in https://github.com/sg...

@merrymercy

Highlights

The PD disaggregation and large-scale EP functionalities from the blog post have now been fully merged into the latest release.
The blog has been successfully reproduced by over six industry teams, including the TensorRT LLM team.
SGLang’s large-scale EP is now actively used by leading organizations such as Cursor, Qwen, Alimama, Alibaba Cloud, iFlytek, and more. It has been deployed and validated at large scale, running on GPU clusters with thousands of devices.
PD disaggregation and large-scale EP, in addition to supporting DeepSeek V3/R1, now also support Qwen 3 in the latest release.
Full Blackwell support for DeepSeek V3/R1, Llama 4, and Qwen 3. Further optimizations are underway.
SGLang's DeepSeek V3/R1 now achieves 190 TPS on single H200, outperforming other frameworks by over 50%.

We extend our sincere thanks to the following contributors, listed in alphabetical order: Alibaba Cloud, AMD Team, Ant Group, Baseten Team, Cursor Team, Dynamo Team, EAGLE Team, FlashInfer Team, Google Vertex AI Team, iFlytek MaaS Team, Intel Team, LinkedIn Team, Meituan Team, Microsoft Copilot Team, Mooncake Team, NVIDIA Team, Oracle Team, Qwen Team, Voltage Park Team and open source community users. Your support and collaboration are deeply appreciated!

What's Changed

Update nightly-test.yml by @merrymercy in #5797
[CI] Improve github summary & enable fa3 for more models by @merrymercy in #5796
[Docs] update grafana setup guide in production metrics by @PopSoda2002 in #5643
[Misc] add structure logging, write to file and log tracing for SGL R… by @slin1237 in #5741
Improve overlap scheduling by @hnyls2002 in #5788
Add Cutlass MLA attention backend by @trevor-m in #5390
chore: upgrade sgl-kernel 0.1.0 by @zhyncs in #5690
Dockerfile.dev pip scikit_build_core by @BBuf in #5807
Add a doc to fix sgl-kernel build link error in py39 with ccache by @BBuf in #5809
Turn on overlap scheduler for multimodal models by @merrymercy in #5771
Tiny refactor DefaultModelLoader.Source by @fzyzcjy in #5482
[Docs] Replace lists with tables for cleanup and readability in server_arguments by @windsonsea in #5276
Revert "Tiny refactor DefaultModelLoader.Source" by @merrymercy in #5825
Feat: add support for thinking mode via chat_template_kwargs.enable_t… by @minleminzui in #5551
fix: fix the error where the content is None when reasoning and tool … by @minleminzui in #5838
feat: Add fused moe triton config for qwen3 moe on h100 by @JustinTong0323 in #5833
fused moe triton tuning script support qwen3 by @BBuf in #5842
feat: Add fused moe triton config for qwen3bf16 moe on h20 by @yhyang201 in #5839
[PD] support pd fake transfer for warmup by @whybeyoung in #5726
[qwen3] qwen3moe_tune_h20 fp8 tp4 by @whybeyoung in #5846
[Doc] Recover history of server_arguments.md by @Fridge003 in #5851
feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 by @GeLee-Q in #5850
[CI] test chunked prefill more by @merrymercy in #5798
ROCm: update AITER by @HaiShaw in #5816
[Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel by @yinfan98 in #5847
[Fix] Missing bootstrap_port field by @xutianyi1999 in #5823
feat: update is_fa3_default_architecture by @zhyncs in #5854
add fused moe config for qwen3moe fp8/bf16 by @yizhang2077 in #5849
chore: bump v0.4.6.post1 by @zhyncs in #5845
Support max_completion_tokens for OpenAIChatCompletions by @CatherineSue in #5857
simplify fused_moe config logging by @BBuf in #5801
[CI] tune the test order to warmup the server by @merrymercy in #5860
Cutlass MLA decode - fix dtype error by @trevor-m in #5868
cutlass 3.9 supported to improve fp8_blockwise_gemm by @BBuf in #5820
[Feature] support auto chat template by @woodx9 in #4949
Feat: support cuda graph for LoRA by @Qiaolin-Yu in #4115
Add qwen3 30b fused moe config by @JustinTong0323 in #5859
[Fix] Fix a bug for flashmla to run R1 model by @pengcuo in #5875
Add A800 fused moe config for qwen3 30b by @lambert0312 in #5880
[Misc] add service discovery for sgl router by @slin1237 in #5865
[fix]: PyO3 macOS linking and consolidate on tracing for logging by @slin1237 in #5856
chore: update Dockerfile by @zhyncs in #5894
[Docs] Update docs for Qwen3 and Qwen3MoE by @adarshxs in #5836
Tables instead of bulletpoints for sampling doc by @simveit in #5841
chore: update CODEOWNERS by @zhyncs in #5895
[FEATURE] Enhance platform compatibility for ARM by @johnnynunez in #5746
[CI] Add test_function_calling.py to run_suite.py by @CatherineSue in #5896
Auto set draft model path for MTP by @ispobock in #5793
[fix] relax mem_fraction_static for h200 by @Alcanderian in #5893
feat: support pythonic tool call and index in tool call streaming by @CatherineSue in #5725
[Bugfix]: fix missing queue_time_start for requests from grammar_queue by @CatherineSue in #5696
Add AMD MI300x Nightly Testing. by @saienduri in #5861
chore: use torch 2.6 for sgl-kernel build by @zhyncs in #5898
Fix check_env script by @lambert0312 in #5901
[PD] Fix Assertion failed: /DeepEP/csrc/kernels/internode.cu:483, condition: ibgda_get_state()->num_rc_per_pe >= num_channels #134 by @whybeyoung in #5830
Bump Flashinfer to 0.2.5 by @Fridge003 in #5870
[Fix] Unload lora in HF_Runner if needed by @Qiaolin-Yu in #5899
Add A800 fused moe config for qwen3 235b by @lambert0312 in #5900
Add sm_120 for blackwell by @zhjunqin in #5903
[Feature] add support kimi vl model by @liwenju0 in #5383
support vlm benchmark profile by @yizhang2077 in #5905
[fix] kimi-vl test in test_vision_openai_server.py by @Alcanderian in #5910
[Misc] use parallel build for cmake in sgl-kernel by @yinfan98 in #5919
[qwen3] support qwen3 ep moe by @laixinn in #5917
Add TP2 MOE benchmarks for AMD. by @saienduri in #5909
[Feat] Scale up fa3 kernel to sm8x arch by @yinfan98 in #5912
chore: bump sgl-kernel 0.1.1 by @zhyncs in #5932
chore: upgrade sgl-kernel 0.1.1 by @zhyncs in #5933
Remove unused method calculate_num_image_tokens from qwen2_vl.py by @JustinTong0323 in #5783
[PP] Add pipeline parallelism by @Ying1123 in #5724
Fix lora batch processing when input lora_path contains None by @Qiaolin-Yu in #5930
add Thor & Spark by @johnnynunez in #5915
fix: correct stream response when enable_thinking is set to false by @minleminzui in #5881
fix: update model runner by @zhyncs in #5934
chore: bump v0.4.6.post2 by @zhyncs in #5939
Support XiaomiMiMo/MiMo model inference by @ryang-max in #5921
[P...

@BBuf

Highlights

Use FlashAttention3 as the default attention backend for main stream models (DeepSeek, Qwen, Llama, etc). #4709 (comment)
PD disaggregation with mooncake and NIXL transfer backends #4880 #5477 #4655
DeepSeek performance improvements: turn on DeepGemm by default and some kernel fusions. #5580 #5628
Update torch to 2.6.0. Fix torch.compile cache. #5417 #5213
Preliminary support for blackwell #5303

Thanks very much to LinkedIn team, Alibaba Cloud, Mooncake team, NVIDIA Team, AMD Team, Pytorch Team, Ant Group, Baseten Team, Oracle Team, Meituan Team, iFlytek MaaS team and the open source community users for their contributions!

We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!

Coming Soon

Large scale expert parallelism + PD disaggregation #4734 #5524
Pipeline Parallelism #5724
MLA Cutlass Backend #5390

What's Changed

[ci] fix llama4 ci error by @BBuf in #5126
Refactor and Optimize FA3 Code by @hebiao064 in #5090
Add Llama4 user guide by @ispobock in #5133
[Misc] Use pytest.mark.skipif in sgl-kernel test by @yinfan98 in #5137
feat: disable grammar restrictions within reasoning sections by @minleminzui in #4984
[modelopt] automatically inspect if model is ModelOpt quantized and set quantization method by @yundai424 in #5145
[AMD] Fix missing per_token_group_quant_fp8 for ROCm by @hubertlu-tw in #5140
fix multimodal hash feature by @huangtingwei9988 in #5083
Fix run time error in ROCm platform by @kkHuang-amd in #5147
[FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct by @zcnrex in #5103
Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 by @yubofredwang in #4760
Use public model for FA3 speculative decode testing by @yubofredwang in #5152
Add dummy grok test to amd CI. by @saienduri in #5115
fix empty_cache error in pt_weights_iterator by @dangkai4u in #5151
Fix torch compile errors by @kkHuang-amd in #5158
Fix loading KV quantization scale; Enable modelopt kv cache by @yundai424 in #4686
[PD] Fix unclosed prefill connection warning of mini_lb by @ShangmingCai in #5155
Add optimized native kernels in sgl-kernel by @mingfeima in #5150
[PD] Simplify mini LB by @ByronHsu in #4911
Small improvement of native api docs by @simveit in #5139
[feat&refactor] Enhance multimodal input support with refactor io_struct by @JustinTong0323 in #4938
Support 2x8xH100 for Llama 4 by @fzyzcjy in #5159
FP4 weight loading and inference (2/2) by @trevor-m in #3972
Fix multimodal hashing error by @fzyzcjy in #5174
Tiny disable model that does not work by @fzyzcjy in #5175
[Bugfix] Fix index out of bounds in local attention with large sequences by @CatherineSue in #5173
[Fix] DeepEP Compatibility with Low Latency by @liz-badada in #5068
docs: remove the use of Downward API for LWS_WORKER_INDEX by @yankay in #5110
feat: add DeepGEMM build warning by @zhyncs in #5176
fix: use DeepEPDispatcher on CUDA by @zhyncs in #5180
[DeepEP] fix: import buffer error by @ch-wan in #5179
Let bench_one_batch support enable_dp_attention by @fzyzcjy in #4058
[Misc] clean up vllm in sgl-kernel test by @yinfan98 in #5189
Fix ci test "test_eval_fp8_accuracy" failed by @kkHuang-amd in #5185
Optimize topk operation in llama4 by @fzyzcjy in #5128
Support Llama4 fp8 inference by @HandH1998 in #5194
[ci] fix ci test fused_moe op by @BBuf in #5102
model: support mllama4 by @mickqian in #5144
Rework grok test. by @saienduri in #5171
sgl-kernel use cutlass latest version for fp8 blockwise gemm by @yizhang2077 in #5207
Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 by @Muuuchen in #5196
fix: log warning when disable cuda graph by @zhyncs in #5209
[metrics] Add in queue metrics by @hebiao064 in #4444
Fix DeepSeek error when using DeepEP mode by @fzyzcjy in #5190
reduce moe_align_block_size_kernel small batch mode overhead by @BBuf in #5086
[PD] Support KV transfer with mooncake by @stmatengss in #4880
[PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool by @stmatengss in #5204
Update deps for mllama4 by @ispobock in #5215
Fix deepseek-v3 with torch.compile in PyTorch 2.6. by @zou3519 in #5213
ROCm sgl-kernel: compatible to later torch by @HaiShaw in #5167
[Misc] Clean sgl-kernel test by @yinfan98 in #5216
Update Makefile / build script to avoid installing incompatible torch dependency by @elfiegg in #5245
Fix torch.compile cacheing by @zou3519 in #5259
ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations by @HaiShaw in #5228
Optimize attention in llama4 by @fzyzcjy in #5127
Optimize GPU memory usage in FlashAttentionBackend's strided indexing by @CatherineSue in #5262
Support --enable-llama4-multimodal by @ch-wan in #5254
[fix] fix mrope positions not picked up by @mickqian in #5265
doc: nested loop code for offline engine by @minleminzui in #5244
fix: examples for token_in_token_out_vlm by @JustinTong0323 in #5193
Fix a 404 link in send_request.ipynb by @windsonsea in #5280
fix: enable fp4 compilation on cu128 by @zhyncs in #5286
feat: add cu128 identifier for sgl-kernel by @zhyncs in #5287
chore: relax the torch version restriction for sgl-kernel compilation by @zhyncs in #5288
chore: bump sgl-kernel v0.0.8.post1 by @zhyncs in #5289
[PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout by @GaoYusong in #5292
[Docs] Supported Model Docs - Major restructuring by @adarshxs in #5290
fix: update update_wheel_index for cu128 by @zhyncs in #5300
[Docs] Remove the older supported docs section by @adarshxs in #5301
remove moe_align_block_size torch.zeros in small batch/expert mode by @BBuf in #5298
feat: add blackwell Dockerfile by @zhyncs in #5302
feat: add blackwell workflow by @zhyncs in #5303
fix: use fa3 unit test on hopper only by @zhyncs in #5304
misc: update blackwell Dockerfile by @zhyncs in #5306
fix: remove cublas_grouped_gemm by @zhyncs in #5307
fix: update flash attn by @zhyncs in #5308
fix: use deepgemm only on hopper by @zhyncs in #5310
[VLM] Adopt fast image processor by default by @mickqian in #5065
Adjust ci test threshold by @ispobock in #5271
Blackwell Cutlass MLA kernel by @trevor-m in #5142
misc: cleanup 3rdparty by @zhyncs in https:/...

@merrymercy

Highlights

The SGLang team is excited to the release of v0.4.5! This version introduces several significant features, including Llama 4 support, FlashAttention 3 backend, EAGLE3 speculative decoding, DeepEP integration, and disaggregated prefill and decoding.

New Features

Llama 4 Support: We supported Llama 4 model with accuracy matching official benchmark numbers, achieving a zero-shot score of 75.2 on the MMLU Pro dataset for Llama-4-Scout-17B-16E-Instruct model and 80.7 for Llama-4-Maverick-17B-128E-Instruct model. #5092
FlashAttention 3 Backend: Our implementation of the FlashAttention 3 backend delivers significant acceleration for long-context tasks. #4709
EAGLE3 Speculative Decoding: We’re proud to be the first to support EAGLE3 speculative decoding, offering substantial gains in decoding throughput. Learn more in our documentation and the EAGLE3 paper. #4247
DeepEP Integration: By incorporating DeepEP, we enhanced performance for MoE inference.
Disaggregated Prefill and Decoding: We introduced a prototype for disaggregated prefill and decoding, with plans for further optimizations.

Thanks very much to the NVIDIA team, LinkedIn team, EAGLE team, Oracle team, Meituan team, and our incredible open-source community for their invaluable contributions!

Coming Soon

Disaggregated Prefill and Decoding: #4655
Llama 4 Optimization: #5118
EP Enhancement: #4734
FA3 Enhancement: #4709

We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!

What's Changed

Fix a regression introduced by overlapping KV cache writing by @merrymercy in #4375
Update ci_install_dependency.sh to use accelerate 1.4.0 by @merrymercy in #4392
Improve DP attention by @merrymercy in #4390
Fix auto merge & add back get_flat_data_by_layer by @merrymercy in #4393
Add some fused elementwise kernels for grok-1 by @merrymercy in #4398
Fix Llama3.3 tool call support by @CatherineSue in #4320
Fix the output of hidden states after HTTP requests by @Qiaolin-Yu in #4269
Add a dummy grok test case by @merrymercy in #4399
Hot fix for hicache with new page aligned radixtree by @xiezhq-hermann in #4397
bump v0.4.4.post1 by @zhyncs in #4402
Update CODEOWNERS by @merrymercy in #4403
Hierarchical Caching supports MLA by @zeroorhero in #4009
cleanup deps 1/n by @zhyncs in #4400
feat(remote_model): support variable remote backend for model loader by @DellCurry in #3964
[bug] fix duplicate variable MAX_PIXELS in qwen_vl.py by @qibaoyuan in #4419
[Doc] fix wrong flag in deepseek documentation by @lausannel in #4427
Add moe topk softmax templated from vllm by @qingquansong in #4302
bump v0.0.5.post1 by @zhyncs in #4437
Fix maximum recursion depth triggered on exception exit by @merrymercy in #4438
use topk_softmax with sgl-kernel by @zhyncs in #4439
docs: hot fix torch compile cache by @zhaochenyang20 in #4442
ci: update transformers==4.48.3 by @mickqian in #4451
Fix test_create_kvindices unit test by @sleepcoo in #4452
[Fix] Fix errors when using the device except cuda. by @cboss6 in #4455
docs: Add Llama 3.3 to supported models by @JiangJiaWei1103 in #4453
Update bench_serving.py by @xu-song in #4454
bugfix: Update sampling_params.py by @WrRan in #4413
typos: Update sampling_params.md by @WrRan in #4391
Auto-detect device if not specified in server arguments. by @vshekhawat-hlab in #4423
Add support for upcoming QwenMoe by @michaelfeil in #4447
perf: update fused moe config by @mickqian in #4459
typos by @WrRan in #4368
Fix minor style by @merrymercy in #4460
cleanup deps 2/n by @zhyncs in #4464
feat: Add FlashMLA submodule by @shuaills in #4449
[Fix] use torch.cat instead of torch.concat to prevent entering the Autograd backends. by @Alcanderian in #4466
Fix finish step for pr tests and notebook tests by @merrymercy in #4467
Remove filter for pr-tests by @merrymercy in #4468
Add greedy verification kernel by @Ying1123 in #4383
Release sgl-kernel v0.0.5.post2 by @merrymercy in #4469
Revert "feat: Add FlashMLA submodule (#4449)" by @zhyncs in #4470
[Eagle] Remove the greedy branch and some redundant code by @Ying1123 in #4363
Support FlashMLA backend by @sleepcoo in #4472
fix custom allreduce performance/accuracy problem by @yizhang2077 in #4477
400 on empty input_ids by @yinghai in #4481
Update CODEOWNERS by @merrymercy in #4484
Statistical Analysis of the Output Stability of the Deepseek Model by @tanzelin430 in #4202
model: support gemma-3-it by @mickqian in #4424
Initialize image processor for skip-tokenizer-init codepath by @yinghai in #4479
Fix: modelscope env comment by @huiwq1990 in #4474
Fix: Complete int32 to int64 conversion by @xiezhq-hermann in #4465
[ROCm] enable moe topk softmax in amd by @yiakwy-xpu-ml-framework-team in #4448
Feat/support code completion by @woodx9 in #3612
Add endpoint for file support, purely to speed up processing of input_embeds. by @RinRin-32 in #2797
Set xgrammar as the default grammar backend by @minleminzui in #4386
Fix router test by @ByronHsu in #4483
[Fix] use torch.inference_mode() instead of torch.no_grad() by @Alcanderian in #4372
[Feature] Support Deepseek-VL2 by @ccw1996 in #2798
config: Update fused moe config by @mickqian in #4493
Support serving DeepSeek-R1-Channel-INT8 with 32 L40S. by @solrex in #4418
Support Online Quantization for W8A8 by @hebiao064 in #4485
Tool call with text by @xihuai18 in #4067
Nicer standalone engine inferface by @yinghai in #4480
[Fix] Resolve GPU Memory Leak in update_weights_from_tensor by @U-rara in #4446
[Doc] add doc for quantization w8a8_fp8 or w8a8_int8 by @HandH1998 in #4495
Fix data parallel + tensor parallel by @merrymercy in #4499
[ROCm] fix dtype by @yiakwy-xpu-ml-framework-team in #4510
Remove redundant type conversion by @merrymercy in #4513
Update readme by @merrymercy in #4517
[sgl-router] improvement to avoid hang by @yinghai in #4482
Revert "feat: update grouped_topk to support softmax and sigmoid" by @ispobock in #4505
bump v0.0.5.post3 by @zhyncs in #4520
upgrade sgl-kernel 0.0.5.post3 by @zhyncs in https://github.com/sgl-project/sg...

@zhyncs

Highlights

The SGLang team is excited to announce the release of v0.4.4. We will keep improving DeepSeek V3/R1 performance. With the combination of FlashInfer, MTP, DeepGEMM, and Torch Compile optimizations on H200, it can achieve nearly 100 tokens/s, which is currently the fastest open-source implementation. Look out for new optimizations coming soon!

Thanks very much to xAI Team, NVIDIA Team, AMD Team, LinkedIn team, Baseten Team, Oracle Team, Meituan Team and the open source community users for their contributions!

Regarding the use of SGLang for DeepSeek R1 inference acceleration, in addition to the users mentioned in the announcement, there are also teams such as Tencent and Ant Group. We are very happy to have received recognition and usage from these teams!

Though surely there will be bugs and fixes that we'll be discovering and quickly patching in the coming days, including today :) Let's build and ship. Please feel free to join our Slack channel https://slack.sglang.ai/ Cheers!

Optimizations

AMD Performance Leadership: SGLang is now the fastest LLM engine for DeepSeek V3/R1 inference on AMD hardware, as confirmed by AMD's technical blog
Enhanced FlashInfer MLA Support: Now fully compatible with radix cache, chunked prefill, and MTP optimizations - enable with
--enable-flashinfer-mla
Advanced MTP Capabilities: Both Triton and FlashInfer backends now offer comprehensive Multi-Token Prediction support, easily tunable via the bench_speculative script, compatible with radix cache and chunked prefill.
DeepGEMM Integration: Full integration of DeepGEMM for NVIDIA Hopper architectures - enable with
export SGL_ENABLE_JIT_DEEPGEMM=1
Pioneering INT8 Quantization: First industry implementation of INT8 support for DeepSeek R1 models:
- meituan/DeepSeek-R1-Channel-INT8
- meituan/DeepSeek-R1-Block-INT8
Other Optimizations:
- Blackwell architecture Block Scale FP8 GEMM support
- Support page size greater than 1 #4356
- Optimized W8A8 FP8 implementation with performance gains across all architectures (sm80, sm89, sm90), featuring 15%+ improvement specifically on sm89
- Enhanced distributed parallelism capabilities (e.g., two-node configurations with DP 2, TP 16) #4390

Coming soon

Integrate Flash Attention #4385
Integrate FlashMLA #4384
EAGLE 2 optimization #4383
EAGLE 3 day one support #4247
Integrate DeepEP #4232
Prefill and Decoding Disaggregation

What's Changed

update flashinfer-python by @zhyncs in #3557
fix doc by @zhyncs in #3558
Add support for OpenAI API o1 model by @ChuyueSun in #3363
fix sgl-kernel codestyle by @BBuf in #3563
docs: update install by @zhyncs in #3581
Copy config files for MI300X to support in virtualized environments by @yosoyjay in #3505
ROCm docker: triton update by @HaiShaw in #3584
[fix] added support for vlm in offline inference by @FrankLeeeee in #3548
Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 by @ispobock in #3582
[CI] Improve Docs CI Efficiency by @shuaills in #3587
doc: emphasize and notify the usage of chat_template by @mickqian in #3589
fix eagle unit test by @zhyncs in #3591
fix high qps crash when enable mtp by @zhyncs in #3592
fix apply_token_bitmask_inplace_cuda by @zhyncs in #3594
[docs] added favicon to sphinx html by @FrankLeeeee in #3564
fix lockfile and port_registry file permission error by @Jiadalee in #3598
feat: Support Qwen 2.5 vl by @mickqian in #3258
[ROCm] Use tl.range() in block GEMM kernels with num_stages set by host. by @whchung in #3535
Update to latest amd image. by @saienduri in #3597
Benchmark for reasoning models by @simveit in #3532
Draft of updated doc for sampling params. by @simveit in #3260
[docs] Update sampling_params.md by @shuaills in #3617
[docker] added rdma support by @FrankLeeeee in #3619
Revert "[ROCm] Use tl.range() in block GEMM kernels with `num_stage… by @zhyncs in #3632
add mtp unit test by @zhyncs in #3634
update unit test by @zhyncs in #3636
chore: bump v0.4.3.post1 by @zhyncs in #3638
h800 deepseek r1 config and support multi-gpu block-gemm tuning by @BBuf in #3639
feat: support flashinfer mla with prefix cache by @zhyncs in #3643
chore: update flashinfer v0.2.1.post2 by @zhyncs in #3644
chore: bump v0.4.3.post2 by @zhyncs in #3645
use transformers 4.48.3 by @zhyncs in #3650
[ROCm] Add additional block quant GEMM tuning configs for AMD GPUs. by @whchung in #3616
[ROCm] Optimal MOE Tuning for AMD Radeon Graphics by @BruceXcluding in #3567
Deploy multi-node inference (LWS method) using sglang in a K8s cluster by @whybeyoung in #3624
Update amd docker image. by @saienduri in #3654
[Feature] Apply Cublas Grouped Gemm kernel by @Fridge003 in #3629
update pr-test by @zhyncs in #3663
Fix draft decode max batch size by @ispobock in #3676
fix: remove dependency on latest transformers impl by @mickqian in #3635
AMD Prefill optimize by @fsx950223 in #3665
fix: apply cache size limit of attention mask for VisionAttention by @mickqian in #3657
set NCCL_IB_GID_INDEX=3 for multi node NVIDIA InfiniBand if needed by @zhyncs in #3698
use warp shuffle style reduce and flashinfer vectorize by @BBuf in #3628
[Docs] Add SkyPilot DeepSeek example by @Michaelvll in #3706
[k8s] remove unnecessary hostIPC for security concern by @panpan0000 in #3700
[moe] optim: reduce memory consumption in fused_moe by @ch-wan in #3692
[Improve] Fix Multi-User Port Allocation Conflicts by @shuaills in #3601
Variance measure for reasoning benchmark by @simveit in #3677
Docs: Fix layout with sub-section by @zhaochenyang20 in #3710
add control for cutlass fp8 blockwise gemm by @yizhang2077 in #3727
revert BLOCK and num_warps on HIP by @HaiShaw in #3722
Optimize triton attention custom mask by @ispobock in #3731
[Bugfix] Fix scores mask for moe topk by @Chen-XiaoBing in #3705
[Docs] Modify ep related server args and remove cublas part of deepseek by @Fridge003 in #3732
[Fix] Fix bugs and refactor codes in lora for better scalability. by @aoshen524 in #3652
docs: fix 404 link by @trayvonpan in #3588
[docs] added torch.compile cache to dpsk manual by @FrankLeeeee in #3737
AMD/ROCm: update AITER repo to ROCm/aiter by @HaiShaw in #3747
feat: update grouped_topk to support softmax and sigmoid by @zixuanzhang226 in #3680
feat: Add SageMaker support by @andjsmi in #3740
Change description of nvidia jetson docs by @shahizat in https://github.com/sgl-proj...

@yzh119

Highlights

The SGLang team is excited to announce the release of v0.4.3. We will keep improving DeepSeek V3/R1 performance. In the last six weeks, SGLang has been the fastest engine running DeepSeek V3/R1 among all open-source LLM inference engines. We stay ahead by integrating FlashInfer MLA and optimizing further. Look out for new optimizations coming soon! Please feel free to join our Slack channel https://slack.sglang.ai Cheers!

Performance Improvements

DeepSeek V3/R1 Optimizations

Pioneering integration of FlashInfer MLA Attention delivers 4x performance improvement for long-context scenarios (Special thanks to the FlashInfer team @yzh119 ) #3550
Added torch.compile support for FP8, achieving 50 tokens/s for online inference #3232
Implemented CUTLASS block-wise FP8 for enhanced efficiency

Architecture Enhancements

Upgraded to FlashInfer v0.2
Enabled Flash Attention 3 by default for prefill
Extended EAGLE 2 support:
- Enhanced integration with FlashInfer backend
- Added support in Triton backend

New Features

Introduced Function Calling capabilities
Added regex pattern support in XGrammar backend
Implemented custom sampling processor for flexible inference control
Integrated LoRA support in Triton backend

What's Changed

docs: add deepseek v3 launch instructions by @zhyncs in #2589
fix: only enable moe_align_block_size for now by @zhyncs in #2590
docs: update deepseek v3 example by @zhyncs in #2592
h100 tuning fused_moe_triton for qwen2 moe by @BBuf in #2560
Fix cache hit rate when chunked prefill by @hnyls2002 in #2555
Update README.md by @merrymercy in #2594
Error occurs when loading the gemma model in bitsandbytes format. by @upskyy in #2557
[Feature] Support new parameter - EBNF in xgrammar by @adarshxs in #2526
update readme of DeepSeek V3 by @fsygd in #2596
Fix logprob_start_len for multi modal models by @merrymercy in #2597
Fix duplicated handling of GetWeightsByNameReqInput by @fzyzcjy in #2565
[unittest] add unit test to test quant args of srt engine by @JamesSand in #2574
Fix test and benchmark scripts by @merrymercy in #2598
fix: package data missing by @yudian0504 in #2521
[UTILS] improve makefile a bit by adding help info by @kzhou003 in #2570
Super tiny typo fix by @fzyzcjy in #2564
Update contributor_guide.md by @merrymercy in #2603
Update README.md by @merrymercy in #2605
Tiny code cleanup in tokenizer_manager.py by @fzyzcjy in #2586
Regression fix to AMD/ROCm from recent change by @HaiShaw in #2606
Update CODEOWNERS by @merrymercy in #2608
Fused moe triton cfg opt for rocm by @kkHuang-amd in #2612
Fix triton kernel performance regression by @kkHuang-amd in #2611
Change extend attention kernel launch parameter for ROCm platform to … by @kkHuang-amd in #2610
fix moe_align_block_size by @HandH1998 in #2615
update sgl_moe_align_block_size usage by @HandH1998 in #2617
chore: bump v0.4.1.post1 by @zhyncs in #2616
docs: update README by @zhyncs in #2618
[FIX] Update EOS from config by @zhengy001 in #2475
[minor] clean up docs and eos id by @merrymercy in #2622
Add more supporting organizations by @merrymercy in #2623
Update readme by @ispobock in #2625
avoid fused_moe_triton padding circular import by @BBuf in #2624
[CI] Fix nightly test and raise better error message by @merrymercy in #2626
Docs: Add constrained decoding tutorial by @shuaills in #2614
[docs]Refactor constrained decoding tutorial by @shuaills in #2633
add configs for block fp8 related kernels by @zhyncs in #2628
Add update_weights_from_tensor by @fzyzcjy in #2631
[Feature] Function Calling by @Tushar-ml in #2544
[Docs] Add EBNF to sampling params docs by @adarshxs in #2609
Clean up wrapper in flashinfer backend by @merrymercy in #2638
minor: add nsys cli for docker dev by @zhyncs in #2639
Add llama_eagle.py by @merrymercy in #2640
[Session] Update session control interface by @Ying1123 in #2635
AMD: set weights and scaling numbers properly for block FP8 by @HaiShaw in #2637
Update Triton configs for block fp8 kernels by @HandH1998 in #2641
chore: bump v0.4.1.post2 by @zhyncs in #2643
docs: update README by @zhyncs in #2644
docs: add development guide using docker by @zhyncs in #2645
[Feature] Get Token IDs with Engine.generate() by @shuaills in #2636
Fix unittest for input tokens by @shuaills in #2646
skip special token for unit test by @zhaochenyang20 in #2648
Release 0.4.1.post3 - upload the config.json to PyPI by @merrymercy in #2647
Update the timeout in nightly-test.yml by @merrymercy in #2649
add 2*h20 node serving example for deepseek v3 by @Lzhang-hub in #2650
docs: update README by @zhyncs in #2651
[feat] Add math eval to CI by @XiaotongJiang in #2652
Revert "[feat] Add math eval to CI" by @merrymercy in #2656
fix typo by @HaiShaw in #2655
[Docs] clean up structured outputs docs by @merrymercy in #2654
Update structured_outputs.ipynb by @merrymercy in #2666
Refactor sgl-kernel build by @ispobock in #2642
Refactor logprob computation to return the real logprob used in sampling by @merrymercy in #2664
Add GemLite caching after each capture by @mobicham in #2669
AMD DeepSeek_V3 FP8 Numerical fix by @HaiShaw in #2667
Minor follow-up fixes for the logprob refactor by @merrymercy in #2670
Tiny update scripts to fail fast by @fzyzcjy in #2672
Improve the computation for time_per_output_token Prometheus metrics by @merrymercy in #2674
Add cutlass submodule for sgl-kernel by @ispobock in #2676
minor: cleanup sgl-kernel by @zhyncs in #2679
Eagle speculative decoding part 1: Support target model verification in the attention backend by @merrymercy in #2678
misc: update CODEOWNERS by @zhyncs in #2680
feat: use CUDA 12.4 by default (for FA3) by @zhyncs in #2682
Update README.md by @merrymercy in #2683
Eagle speculative decoding part 2: Fix cuda graph + DP attention hanging by @merrymercy in #2684
[Fix] fix openai adapter by @Ying1123 in #2685
h200 tuning fused_moe_triton config for Mixtral 8x7B/8x22B and Qwen2 57BA14B by @BBuf in #2689
[Docs] refactor Contribution Guide by @shuaills in #2690
Doc: Rename contribution_guide.md by @zhaochenyang20 in #2691
ROCm base image update by @kkHuang-amd in #2692
[Docs] Add Support for Structured Output Format by @shuaills in #2697
[feat]...

@ispobock

Highlights

We're excited to announce SGLang v0.4.1, which now supports DeepSeek V3 - currently the strongest open-source LLM, even surpassing GPT-4o.

The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPU from day one. We've also supported MLA optimization and DP attention before, making SGLang one of the best open-source LLM engines for running DeepSeek models.

Special thanks to Meituan's Search & Recommend Platform Team @ispobock @HandH1998 and Baseten's Model Performance Team @zhyncs for implementing the model, and DataCrunch for providing GPU resources.
Various improvements to the cache-aware sglang router, torchao integration, server termination
Added a standalone package sgl-kernel for supporting more custom kernels in the code base.

What's Changed

Adding SGLang FP8 Utils by @HaiShaw in #2348
docs: add SGLang v0.4 blog by @zhyncs in #2341
MLA prefill w/o weight absorption by @ispobock in #2349
Check gpu availability at server args creation by @MrAta in #2340
minor: limit the range of vllm versions by @zhyncs in #2350
Fix Docs CI When Compile Error by @zhaochenyang20 in #2323
Add Docs For SGLang Native Router by @zhaochenyang20 in #2308
Make torch TP composable with torch.compile by @kwen2501 in #2352
move apply_torchao_config_ to model_runner by @jerryzh168 in #2342
[Minor] Code style improvements by @merrymercy in #2355
Fix AWQ with enable MLA by @ispobock in #2364
MoE Expert Parallel by @xiaobochen123 in #2371
Move FP8 to SGLang by @zhyncs in #2370
optimize cuda graph max_bs_settings on low-end gpus by @BBuf in #2360
Add more support for intel Gaudi accelerators by @YangQun1 in #2357
[router] support /add_worker api by @ByronHsu in #2369
docs: update adoption (Meituan) by @zhyncs in #2373
Use proc.join instead of busy waiting by @merrymercy in #2374
docs: Improve instructions for supporting new models by @vchzls in #2363
Fix the overlap for xgrammar by @merrymercy in #2377
Release v0.4.0.post1 by @merrymercy in #2375
[Router] remove duplicate char count by @ByronHsu in #2378
[router] add remove tenant method in the radix tree by @ByronHsu in #2379
[router] Add remove worker api by @ByronHsu in #2380
fix: resolve fp8 moe issue by @zhyncs in #2387
fix: update xgrammar v0.1.6 by @zhyncs in #2390
Fp8 MoE optimizations on AMD by @HaiShaw in #2388
minor: update killall script by @zhyncs in #2391
[router] Health check on worker before added to the router by @ByronHsu in #2392
Fix shape error that occurred when loading lora weight of gemma2 model. by @upskyy in #2330
nit: Remove busy waiting on scheduler by @rkooo567 in #2382
Optimize Triton decoding kernel for long context by @ispobock in #2394
Update killall_sglang.sh by @merrymercy in #2397
Remove unused vars in the triton backend by @ispobock in #2401
Fix a bug with logprob streaming + chunked prefill by @merrymercy in #2403
fix: specify dtype with begin_forward aka plan by @zhyncs in #2404
Fix recv_requests by @merrymercy in #2405
minor: update correct measurement unit by @zhyncs in #2406
feat: support custom task runner by @zhyncs in #2407
minor: add random use case by @zhyncs in #2408
minor: add random flashinfer vs triton use case by @zhyncs in #2409
Simplify stream_output by @merrymercy in #2398
[router] Improve cleanup logic by @ByronHsu in #2411
[Router] fix interrupt from terminal by @ByronHsu in #2413
[router] defer health checking to router init by @ByronHsu in #2393
reduce watchdog interval to 5s by @ByronHsu in #2410
Add a unittest for fused_moe by @BBuf in #2416
[Minor] Improve code style by @merrymercy in #2419
[Minor] Improve code style by @merrymercy in #2422
[feat] Enable chunked prefill for llava-onevision by @Ying1123 in #2412
Typo fix in router.md by @adarshxs in #2424
feat: support sgl-kernel PyPI by @zhyncs in #2433
fix: use manylinux2014_x86_64 tag by @zhyncs in #2434
fix: compatible with PEP 440 by @zhyncs in #2435
[router] Refactor: decouple select and send stage by @ByronHsu in #2440
[router] Use borrow if possible to save cost by @ByronHsu in #2441
Make torch TP composable with torchao by @kwen2501 in #2436
chore: update ao v0.7.0 by @zhyncs in #2447
decoding attention kernel benchmark by @bjmsong in #2425
Fix model loader for more quantization formats by @merrymercy in #2448
Fix warmup in bench_offline_throughput.py by @merrymercy in #2449
Add support for IBM Granite 3.x models by @frreiss in #2437
[router] Add retries based fault tolerance by @ByronHsu in #2452
[router] remove main.rs because only lib.rs is used for py binding by @ByronHsu in #2453
[Core] in batch prefix caching by delay scheduling by @rkooo567 in #2442
[router] Update doc for dynamic scaling and fault tolerance by @ByronHsu in #2454
[router] Release router 0.1.0 with dynamic scaling and fault tolerance by @ByronHsu in #2455
Make request payload size configurable by @MrAta in #2444
Include version info into the router package by @MrAta in #2456
Bump sglang-router to 0.1.1 by @MrAta in #2459
chore: bump v0.0.2 for sgl-kernel by @zhyncs in #2462
minor: update pypi tag by @zhyncs in #2463
fix: set runtime path by @zhyncs in #2466
Rename rust folder to sgl-router by @MrAta in #2464
feat: support dev image by @zhyncs in #2469
[Minor] Fix grok model loader by @merrymercy in #2473
Fix correctness issue for triton decoding kernel by @ispobock in #2479
format: add clang-format for sgl-kernel by @zhyncs in #2483
Remove cuda graph batch size adjustment for dp attention by @ispobock in #2484
hotfix: checking for HIP by @zhyncs in #2485
sgl-kernel adapt tensorrt llm custom allreduce by @yizhang2077 in #2481
fix typo by @zhyncs in #2487
[Benchmark] add a benchmark for hf/vllm/sglang rmsnorm by @BBuf in #2486
fix moe-ep accuracy issue for fp8 by @xiaobochen123 in #2489
minor: update flashinfer nightly by @zhyncs in #2490
Small fixes for torchao quant by @jerryzh168 in #2476
Simplify pytorch sampling kernel and logit processor by @merrymercy in #2491
Temporarily disable unit test of torch native attenti...

@zhyncs

Highlights

blog: https://lmsys.org/blog/2024-12-04-sglang-v0-4/

We’re excited to release SGLang v0.4, featuring significant performance improvements and new features:

Zero-overhead batch scheduler: 1.1x increase in throughput.
Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate.
Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement.
Fast structured outputs with xgrammar: up to 10x faster.

What's Changed

fix: add xgrammar dependency by @zhyncs in #2126
docs: fix module docstrings and copyright headers by @XuehaiPan in #2077
feat(pre-commit): trim unnecessary notebook metadata from git history by @XuehaiPan in #2127
Expose max total num tokens from Runtime & Engine API by @henryhmko in #2092
Only stream output on tp rank 0 by @merrymercy in #2124
Revert "Only stream output on tp rank 0" by @merrymercy in #2130
Add initial support for intel Gaudi accelerators by @ankurneog in #2121
Add simple CPU offloading support. by @janimo in #2081
Fix grid size in Triton decoding kernel by @ispobock in #2134
[CI] Fix test cases by @merrymercy in #2137
Add concurrency option for benchmark by @cermeng in #2136
Fix dp print message by @merrymercy in #2138
fix: resolve bench_serving args by @zhyncs in #2139
[router] cache-aware load-balancing router v1 by @ByronHsu in #2114
Bump sglang-router to 0.0.5 by @ByronHsu in #2142
update router doc by @ByronHsu in #2143
fix dp_rank env by @ByronHsu in #2144
Add more api routes (completion, health, etc) to the router by @ByronHsu in #2146
add prefix match for certain tenant by @ByronHsu in #2147
Improve sglang router by @ByronHsu in #2148
Merged three native APIs into one: get_server_info by @henryhmko in #2152
feat: remove the dependency on FusedMoE by @zhyncs in #2153
feat: update gitignore and add tuning config for FusedMoE by @zhyncs in #2155
fix: resolve end-of-file-fixer by @zhyncs in #2157
Simplify Scheduler.update_running_batch by @merrymercy in #2154
feat: update other MoE models deps by @zhyncs in #2156
Update CI threshold & Improve code style by @merrymercy in #2159
fix: use torch.sum for compatible by @zhyncs in #2161
Fix mixed chunked prefill in overlap mode by @merrymercy in #2158
Balance CI tests by @merrymercy in #2162
Rename triton_fused_moe -> fused_moe_triton by @merrymercy in #2163
Fix docs by @merrymercy in #2164
[Fused moe] add tuning fused configs for qwen2 57b and mixtral 8x7b by @BBuf in #2167
Allow overwrite flashinfer use_tensorcore by @merrymercy in #2169
Replace prob based with threshold based load balancing by @ByronHsu in #2170
feat: fused_moe fp8 monkey patch by @zhyncs in #2174
[Fix] Avoid calling fill_vocab_mask for terminated requests by @Ubospica in #2175
[CI] Split test cases in CI for better load balancing by @merrymercy in #2180
Bump rustls from 0.23.16 to 0.23.18 in /rust by @dependabot in #2182
[feat] Refactor session control interface and add CI by @Ying1123 in #2173
[router] Replace print with logger by @ByronHsu in #2183
Use custom allreduce w/ torch.compile by @merrymercy in #2185
[Performance]: Process affinity to CPU cores with multiple sockets support by @HaiShaw in #2171
Update CI threshold by @merrymercy in #2186
Update XGrammar to the latest API by @Ubospica in #2176
[router] Rust e2e test by @ByronHsu in #2184
Input_embeds support by @RinRin-32 in #2052
[CI] Minor fix for CI by @merrymercy in #2187
Rename double sparsity config file by @merrymercy in #2188
Release v0.3.6.post1 by @merrymercy in #2189
Update sampler.py to skip the success check by @merrymercy in #2197
remove unused imports by @WrRan in #2195
Remove unresolved reference 'self' by @apemost in #2198
using is not not != to test None by @WrRan in #2196
fix: add cuda-python for xgrammar by @zhyncs in #2199
minor: update check_env by @zhyncs in #2201
add sglang version to get_server_info by @binarycrayon in #2206
docs: update adoption by @zhyncs in #2204
Bump router to 0.0.9 with better logging by @ByronHsu in #2207
Fix rust warning by @ByronHsu in #2208
Fix flasky tests by @merrymercy in #2212
[feat] Support session control for vision language models by @Ying1123 in #2210
Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default by @merrymercy in #2217
Revert "Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default" by @merrymercy in #2221
Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default by @merrymercy in #2222
Release v0.3.6.post2 by @merrymercy in #2214
Rename DP_RANK to SGLANG_DP_RANK by @merrymercy in #2218
[3rdparty, document] Updated Documentation that for triton fused_moe kernel tuning for AMD Instinct GPUs by @kkHuang-amd in #2191
Bump sglang-router to 0.0.10 for env name change by @ByronHsu in #2226
fix typo prompts by @qibaoyuan in #2224
Remove fused_moe_grok by @merrymercy in #2223
add profile in offline benchmark & update doc by @bjmsong in #2123
Rename tuned MI300X config files for fused_moe_triton by @HaiShaw in #2228
Update Install Method 2. From source by @HaiShaw in #2232
Fix chunked prefill size for bench_offline_throughput by @merrymercy in #2234
Disable overlap scheduler for multimodal models by @merrymercy in #2235
Add OLMo2 model. by @janimo in #2233
Crash the server correctly during error by @merrymercy in #2231
Fix memory leak during abort by @merrymercy in #2238
fix missing launch server import by @qeternity in #2242
[fix] Fix prefix caching for multi-image/video by @Ying1123 in #2239
Update backend.md by @merrymercy in #2250
Update backend.md by @merrymercy in #2251
Revert "Add simple CPU offloading support" by @Ying1123 in #2252
Revert "Revert "Add simple CPU offloading support"" by @Ying1123 in #2253
Simplify tokenizer manager by @merrymercy in #2254
Fix hash collision for multi modal models by @merrymercy in #2256
[Minor] fix the style for multimodal models by @merrymercy in #2257
chore: bump v0.3.6.post3 by @zhyncs in https://github.com/sgl-project/sglang/pul...

Releases: sgl-project/sglang

Release v0.5.1

What's Changed

Contributors

Uh oh!

v0.4.10

Highlights

What's Changed

Contributors

Uh oh!

Release v0.4.8

Highlights

OpenAI-Compatible Server Refactor

DeepSeek R1 FP4 on Blackwell GPU

Breaking Change: OpenAI-Compatible API Module Moved

What's Changed

Contributors

Uh oh!

Release v0.4.7

Highlights

What's Changed

Contributors

Uh oh!

Release v0.4.6

Highlights

Coming Soon

What's Changed

Contributors

Uh oh!

Release v0.4.5

Highlights

New Features

Coming Soon

What's Changed

Contributors

Uh oh!

Release v0.4.4

Highlights

Optimizations

Coming soon

What's Changed

Contributors

Uh oh!

v0.4.3

Highlights

Performance Improvements

DeepSeek V3/R1 Optimizations

Architecture Enhancements

New Features

What's Changed

Contributors

Uh oh!

Release v0.4.1

Highlights

What's Changed

Contributors

Uh oh!

Release v0.4.0

Highlights

What's Changed

Contributors

Uh oh!