Skip to content

Releases: sgl-project/sglang

Release v0.5.1

23 Aug 19:57
97a38ee
Compare
Choose a tag to compare

What's Changed

Read more

v0.4.10

31 Jul 18:48
0232886
Compare
Choose a tag to compare

Highlights

This is a regular release with many new optimizations, features, and fixes. Please checkout the following exciting roadmaps and blogs

What's Changed

Read more

Release v0.4.8

24 Jun 18:43
7c3a12c
Compare
Choose a tag to compare

Highlights

OpenAI-Compatible Server Refactor

Re-structured the OpenAI-compatible server to support production and enterprise environments. Key improvements include:

  • Consistent metrics and logging for better observability and debugging.

  • Unified error handling, request validation, and processing logic for improved reliability and maintainability.

  • Improved request tracking across sessions and components.

  • Fixed bugs in embedding requests and reasoning parsers.

This work was a collaborative effort involving engineers from academic and industry institutions. Special thanks to the Oracle Cloud team and the SGLang team and community — including @slin1237, @CatherineSue, @key4ng, @JustinTong0323, @jhinpan, @yhyang201, @woodx9 and @whybeyoung — for their invaluable contributions.

DeepSeek R1 FP4 on Blackwell GPU

Added support for DeepSeek R1 with FP4 and MTP on NVIDIA Blackwell GPU.

  • Integrated FlashInfer NVFP4 MoE, supporting TP, EP, and DP.

  • Supported 2-stream shared expert execution.

  • Achieved up to 90 TPS per user at isl/osl/bs = 1k/1k/16 on B200.

Further optimization in progress. Special thanks to the FlashInfer, NVIDIA Enterprise Products, Novita AI, DataCrunch, Google Cloud, and SGLang teams — especially @Alcanderian and @pyc96 — for their critical contributions.

Breaking Change: OpenAI-Compatible API Module Moved

The sglang/srt/openai_api directory has been removed and replaced with sglang/srt/entrypoints/openai.

Update your imports to the new module path. For example:

- from sglang.srt.openai_api.protocol import Tool
+ from sglang.srt.entrypoints.openai.protocol import Tool

What's Changed

Read more

Release v0.4.7

11 Jun 19:14
4f723ed
Compare
Choose a tag to compare

Highlights

  • The PD disaggregation and large-scale EP functionalities from the blog post have now been fully merged into the latest release.

  • The blog has been successfully reproduced by over six industry teams, including the TensorRT LLM team.

  • SGLang’s large-scale EP is now actively used by leading organizations such as Cursor, Qwen, Alimama, Alibaba Cloud, iFlytek, and more. It has been deployed and validated at large scale, running on GPU clusters with thousands of devices.

  • PD disaggregation and large-scale EP, in addition to supporting DeepSeek V3/R1, now also support Qwen 3 in the latest release.

  • Full Blackwell support for DeepSeek V3/R1, Llama 4, and Qwen 3. Further optimizations are underway.

  • SGLang's DeepSeek V3/R1 now achieves 190 TPS on single H200, outperforming other frameworks by over 50%.

We extend our sincere thanks to the following contributors, listed in alphabetical order: Alibaba Cloud, AMD Team, Ant Group, Baseten Team, Cursor Team, Dynamo Team, EAGLE Team, FlashInfer Team, Google Vertex AI Team, iFlytek MaaS Team, Intel Team, LinkedIn Team, Meituan Team, Microsoft Copilot Team, Mooncake Team, NVIDIA Team, Oracle Team, Qwen Team, Voltage Park Team and open source community users. Your support and collaboration are deeply appreciated!

What's Changed

Read more

Release v0.4.6

27 Apr 21:47
84022c0
Compare
Choose a tag to compare

Highlights

  • Use FlashAttention3 as the default attention backend for main stream models (DeepSeek, Qwen, Llama, etc). #4709 (comment)
  • PD disaggregation with mooncake and NIXL transfer backends #4880 #5477 #4655
  • DeepSeek performance improvements: turn on DeepGemm by default and some kernel fusions. #5580 #5628
  • Update torch to 2.6.0. Fix torch.compile cache. #5417 #5213
  • Preliminary support for blackwell #5303

Thanks very much to LinkedIn team, Alibaba Cloud, Mooncake team, NVIDIA Team, AMD Team, Pytorch Team, Ant Group, Baseten Team, Oracle Team, Meituan Team, iFlytek MaaS team and the open source community users for their contributions!

We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!

Coming Soon

  • Large scale expert parallelism + PD disaggregation #4734 #5524
  • Pipeline Parallelism #5724
  • MLA Cutlass Backend #5390

What's Changed

Read more

Release v0.4.5

07 Apr 08:33
57f9960
Compare
Choose a tag to compare

Highlights

The SGLang team is excited to the release of v0.4.5! This version introduces several significant features, including Llama 4 support, FlashAttention 3 backend, EAGLE3 speculative decoding, DeepEP integration, and disaggregated prefill and decoding.

New Features

  • Llama 4 Support: We supported Llama 4 model with accuracy matching official benchmark numbers, achieving a zero-shot score of 75.2 on the MMLU Pro dataset for Llama-4-Scout-17B-16E-Instruct model and 80.7 for Llama-4-Maverick-17B-128E-Instruct model. #5092

  • FlashAttention 3 Backend: Our implementation of the FlashAttention 3 backend delivers significant acceleration for long-context tasks. #4709

  • EAGLE3 Speculative Decoding: We’re proud to be the first to support EAGLE3 speculative decoding, offering substantial gains in decoding throughput. Learn more in our documentation and the EAGLE3 paper. #4247

  • DeepEP Integration: By incorporating DeepEP, we enhanced performance for MoE inference.

  • Disaggregated Prefill and Decoding: We introduced a prototype for disaggregated prefill and decoding, with plans for further optimizations.

Thanks very much to the NVIDIA team, LinkedIn team, EAGLE team, Oracle team, Meituan team, and our incredible open-source community for their invaluable contributions!

Coming Soon

  • Disaggregated Prefill and Decoding: #4655

  • Llama 4 Optimization: #5118

  • EP Enhancement: #4734

  • FA3 Enhancement: #4709

We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!

What's Changed

Read more

Release v0.4.4

13 Mar 18:21
6aaeb84
Compare
Choose a tag to compare

Highlights

The SGLang team is excited to announce the release of v0.4.4. We will keep improving DeepSeek V3/R1 performance. With the combination of FlashInfer, MTP, DeepGEMM, and Torch Compile optimizations on H200, it can achieve nearly 100 tokens/s, which is currently the fastest open-source implementation. Look out for new optimizations coming soon!

Thanks very much to xAI Team, NVIDIA Team, AMD Team, LinkedIn team, Baseten Team, Oracle Team, Meituan Team and the open source community users for their contributions!

Regarding the use of SGLang for DeepSeek R1 inference acceleration, in addition to the users mentioned in the announcement, there are also teams such as Tencent and Ant Group. We are very happy to have received recognition and usage from these teams!

Though surely there will be bugs and fixes that we'll be discovering and quickly patching in the coming days, including today :) Let's build and ship. Please feel free to join our Slack channel https://slack.sglang.ai/ Cheers!

Optimizations

  • AMD Performance Leadership: SGLang is now the fastest LLM engine for DeepSeek V3/R1 inference on AMD hardware, as confirmed by AMD's technical blog

  • Enhanced FlashInfer MLA Support: Now fully compatible with radix cache, chunked prefill, and MTP optimizations - enable with
    --enable-flashinfer-mla

  • Advanced MTP Capabilities: Both Triton and FlashInfer backends now offer comprehensive Multi-Token Prediction support, easily tunable via the bench_speculative script, compatible with radix cache and chunked prefill.

  • DeepGEMM Integration: Full integration of DeepGEMM for NVIDIA Hopper architectures - enable with
    export SGL_ENABLE_JIT_DEEPGEMM=1

  • Pioneering INT8 Quantization: First industry implementation of INT8 support for DeepSeek R1 models:

  • Other Optimizations:

    • Blackwell architecture Block Scale FP8 GEMM support

    • Support page size greater than 1 #4356

    • Optimized W8A8 FP8 implementation with performance gains across all architectures (sm80, sm89, sm90), featuring 15%+ improvement specifically on sm89

    • Enhanced distributed parallelism capabilities (e.g., two-node configurations with DP 2, TP 16) #4390

Coming soon

  • Integrate Flash Attention #4385

  • Integrate FlashMLA #4384

  • EAGLE 2 optimization #4383

  • EAGLE 3 day one support #4247

  • Integrate DeepEP #4232

  • Prefill and Decoding Disaggregation

What's Changed

Read more

v0.4.3

14 Feb 02:50
e0b9a42
Compare
Choose a tag to compare

Highlights

The SGLang team is excited to announce the release of v0.4.3. We will keep improving DeepSeek V3/R1 performance. In the last six weeks, SGLang has been the fastest engine running DeepSeek V3/R1 among all open-source LLM inference engines. We stay ahead by integrating FlashInfer MLA and optimizing further. Look out for new optimizations coming soon! Please feel free to join our Slack channel https://slack.sglang.ai Cheers!

Performance Improvements

DeepSeek V3/R1 Optimizations

  • Pioneering integration of FlashInfer MLA Attention delivers 4x performance improvement for long-context scenarios (Special thanks to the FlashInfer team @yzh119 ) #3550
  • Added torch.compile support for FP8, achieving 50 tokens/s for online inference #3232
  • Implemented CUTLASS block-wise FP8 for enhanced efficiency

Architecture Enhancements

  • Upgraded to FlashInfer v0.2
  • Enabled Flash Attention 3 by default for prefill
  • Extended EAGLE 2 support:
    • Enhanced integration with FlashInfer backend
    • Added support in Triton backend

New Features

  • Introduced Function Calling capabilities
  • Added regex pattern support in XGrammar backend
  • Implemented custom sampling processor for flexible inference control
  • Integrated LoRA support in Triton backend

What's Changed

Read more

Release v0.4.1

25 Dec 23:27
efc52f8
Compare
Choose a tag to compare

Highlights

  • We're excited to announce SGLang v0.4.1, which now supports DeepSeek V3 - currently the strongest open-source LLM, even surpassing GPT-4o.

    The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPU from day one. We've also supported MLA optimization and DP attention before, making SGLang one of the best open-source LLM engines for running DeepSeek models.

    Special thanks to Meituan's Search & Recommend Platform Team @ispobock @HandH1998 and Baseten's Model Performance Team @zhyncs for implementing the model, and DataCrunch for providing GPU resources.

  • Various improvements to the cache-aware sglang router, torchao integration, server termination

  • Added a standalone package sgl-kernel for supporting more custom kernels in the code base.

What's Changed

Read more

Release v0.4.0

04 Dec 02:14
f8b0326
Compare
Choose a tag to compare

Highlights

blog: https://lmsys.org/blog/2024-12-04-sglang-v0-4/

We’re excited to release SGLang v0.4, featuring significant performance improvements and new features:

  • Zero-overhead batch scheduler: 1.1x increase in throughput.
  • Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate.
  • Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement.
  • Fast structured outputs with xgrammar: up to 10x faster.

What's Changed

Read more