Skip to content

Conversation

saienduri
Copy link
Collaborator

Motivation

Modifications

Checklist

@merrymercy merrymercy added the ready-to-merge The PR is ready to merge after the CI is green. label Apr 27, 2025
@merrymercy merrymercy merged commit c5e1026 into main Apr 27, 2025
5 checks passed
@merrymercy merrymercy deleted the docker-update branch April 27, 2025 01:46
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request May 16, 2025
* fix: update pr-test-sgl-kernel (sgl-project#5399)

* kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381)

* chore: bump sgl-kernel 0.0.9 (sgl-project#5400)

* chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401)

* Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406)

* Fix bench_serving with random-ids (sgl-project#5214)

* [misc] fix ci flaky case (sgl-project#5352)

* [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412)

* Support dynamic connection and TP 16 (sgl-project#5351)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416)

* [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Co-authored-by: ybyang <ybyang7@iflytek.com>

* Distinguish bootstrap key only in decode server (sgl-project#5422)

* [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423)

* [minor] cleanup cmakelists.txt (sgl-project#5420)

* bugfix: fix merge_state_v2 cuda graph (sgl-project#5419)

* chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430)

* fix: solve release issue (sgl-project#5434)

* BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431)

* feat: update model_specific_adjustment (sgl-project#5344)

Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>

* chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436)

* Fix ignore_eos parameter when loading a chat template (sgl-project#5264)

* add attention backend supporting matrix in the doc (sgl-project#5211)

Co-authored-by: Stefan He <hebiaobuaa@gmail.com>

* Support BNB quantization for llama/mllama (sgl-project#5038)

Co-authored-by: Yuhao Yang <yyh073@foxmail.com>

* [Docs] Update start/install.md (sgl-project#5398)

* [Minor] Move torch.compile patch to a better place (sgl-project#5397)

* [Bug fix] need record start time in pd mode (sgl-project#5425)

* Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113)

* chore: bump v0.4.5.post1 (sgl-project#5445)

* Fix several minor issues in PD disaggregation (sgl-project#5444)

* [doc] Update benchmark_and_profiling.md (sgl-project#5449)

* Update cutlass dependency. (sgl-project#5447)

* add multi-lora feature in README.md (sgl-project#5463)

* Clean up imports (sgl-project#5467)

* [verl] Modify the update_weights func to align with verl's resharding (sgl-project#5345)

Co-authored-by: Chayenne <zhaochen20@outlook.com>

* [Model Support] unsloth/Phi-4-mini bnb model (sgl-project#4982)

Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>

* Update attention_backend.md: plural form (sgl-project#5489)

* Add test for flash_attn_varlen_func kernel (sgl-project#5484)

* Deprecate disable-mla (sgl-project#5481)

* Deprecate enable-flashinfer-mla and enable-flashmla (sgl-project#5480)

* Feat/support encoder model (like bert) (sgl-project#4887)

* Enable local attention during decode (sgl-project#5479)

* Refactor DeepSeek decoder layer branches (sgl-project#5205)

* Fix a link in sgl-kernel/README.md (sgl-project#5493)

* [Bug fix] use correct func path in deepseek (sgl-project#5496)

Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>

* Doc: fix problems of the 'Execute Notebooks / run-all-notebooks' ci caused by the unstability of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B (sgl-project#5503)

* [Feat] Update sgl-kernel flashinfer to latest main version (sgl-project#5500)

Co-authored-by: zhyncs <me@zhyncs.com>

* Fix: Incorrect parameters passed to forward_batch_generation (sgl-project#5506) (sgl-project#5511)

* Fix: fix the exception 'the memory capacity is unbalanced. Some GPUs … (sgl-project#5426)

Co-authored-by: ocss884 <ocss.lin@gmail.com>

* [docs] Fix several consistency issues in sampling_params.md (sgl-project#5373)

Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

* Configuration qwen2_moe.py - qkv_bias now in transformers (sgl-project#5512)

* Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 (sgl-project#4836)

* Sgl kernel fused_moe_gate support n_shared_experts (sgl-project#5440)

* chore: bump sgl-kernel 0.0.9.post2 (sgl-project#5518)

* use sglang_per_token_group_quant_fp8 from sgl-kernel instead of trion kernel (sgl-project#5473)

Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>

* fix kimi vl running bug after rebase main (sgl-project#5461)

* fix bug of VLLM_AVAILABLE not defined (sgl-project#5497)

* Avoid computing lse in Ragged Prefill when there's no prefix. (sgl-project#5476)

Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

* [Model] Adding Qwen3 and Qwen3MoE (sgl-project#4693)

* fix util import (sgl-project#5542)

* Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… (sgl-project#5544)

* chore: upgrade sgl-kernel 0.0.9.post2 (sgl-project#5540)

* Fix DeepGEMM masked cannot be run on groups not being multiple or 4 (sgl-project#5340)

* Make profiler output file names consistent (sgl-project#5548)

* [PD] Tiny fix timeout error when generate (sgl-project#5545)

* [PD] Fix no cache connect for recevier (sgl-project#5534)

* feat: use flashinfer jit package (sgl-project#5547)

* [PD] Remove the requirement of config file for mooncake backend  (sgl-project#5460)

* restruct compressed_tensors_w8a8_fp8 (sgl-project#5475)

* simplify the control logic for using shared experts fusion (sgl-project#5504)

* Remove one kernel in per_tensor_quant_mla_fp8 (sgl-project#5549)

* Fix sampler nan check when calling top_k_top_p_sampling_from_probs (sgl-project#5546)

* [PD] Support page size > 1 (sgl-project#5561)

* fix hicache write back (sgl-project#5543)

* Minor update for ROCm variable style (sgl-project#5562)

* Fix bench_one_batch producing unnatural results for expert parallel (sgl-project#5149)

* [perf] introduce deep gemm group_gemm_masked as bmm (sgl-project#5432)

* [PD] Fix DeepSeek cannot be run on latest master (sgl-project#5568)

* Fix BumpAllocator error when no input_ids (sgl-project#5564)

* enable DeepSeek V3 shared_experts_fusion in sm90 (sgl-project#5571)

* [Fix] fix outlines and xgrammar (sgl-project#4947)

* [Doc]Add instruction for profiling with bench_one_batch (sgl-project#5581)

* Release v0.4.5.post2 (sgl-project#5582)

* Fix bench_serving fail when zero warmup requests (sgl-project#5574)

* Fix DeepEP cannot run on latest master (sgl-project#5567)

* Fix torch memory saver not enabled in DP scenario (sgl-project#5560)

* Super tiny fix typo (sgl-project#5559)

* Add document for LoRA serving (sgl-project#5521)

* Tiny improve error message (sgl-project#5526)

* [PD] Fix server crash when using batch requests (sgl-project#5531)

* [Feat] upgrade pytorch2.6 (sgl-project#5417)

* Fix enable chunked prefill for Llama4 (sgl-project#5575)

* fix: use fa3 for gemma2 (sgl-project#5586)

* Fix ChatCompletionMessageGenericParam to allow for None content (sgl-project#5452)

* [PD] Fix large page size + chunk prefill (sgl-project#5588)

* Add test config yamls for Deepseek v3 (sgl-project#5433)

* [Feature] Prefill assistant response - add continue_final_message parameter (sgl-project#4226)

Co-authored-by: Chayenne <zhaochen20@outlook.com>

* add function call parser for DeepSeek V3 (sgl-project#5224)

* smaller and non gated models for docs (sgl-project#5378)

* Feat: Implement JSON Mode (response_format.type="json_object") (sgl-project#4733)

Co-authored-by: Kyle Pena <kylepena@kyles-macbook-pro.turkey-marlin.ts.net>

* check marlin format before attempting conversion (sgl-project#4675)

* compressed_tensors: port w8a16 fp8 from vllm (sgl-project#4852)

* Fix one more issue reported by torchfix (sgl-project#4859)

* Add sanity check for max_running_requests (sgl-project#5016)

* Correct grafana heatmap. (sgl-project#5019)

* Perform Batch Tokenization. (sgl-project#5141)

* Speedup shared expert weight construction by avoid cloning (sgl-project#5188)

* Tiny add Engine.flush_cache API (sgl-project#5241)

* [misc] remove is_cuda_available (sgl-project#5319)

* Fix flush cache (sgl-project#5590)

* Add Speculative Decoding Eagle3 topk > 1 (sgl-project#5318)

Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: Yubo Wang <yubowang2019@gmail.com>

* upstream hicache fixes (sgl-project#5570)

* Tiny add warning when cannot recognize bool env var (sgl-project#5348)

* Modify metrics service endpoint (sgl-project#3443)

* Update protocol.py to fix sgl-project#4589 (sgl-project#4590)

* [Feat.] Enable grafana to show metrics (sgl-project#4718)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

* [Fix] Enhance DP Attention for IPv6 Compatibility (sgl-project#4937)

* Support o1 model on Azure (sgl-project#4980)

Co-authored-by: Shan Yu <shanyu1@g.ucla.edu>

* Tiny remove duplicated code (sgl-project#5021)

* Tiny update error hint (sgl-project#5037)

* Support PD bootstrap fields on /v1/chat/completions endpoint (sgl-project#5488)

* [PD] Fix generate endpoint of min_lb for PD (sgl-project#5598)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* [PD] Fix edge case and simplify large page size + chunked prefill (sgl-project#5589)

* [PD] Add NIXL transfer backend  (sgl-project#5477)

* [PD] Support decode overlap schedule (sgl-project#5608)

* [PD] Support prefill overlap + Ensure no race condition (sgl-project#5609)

* Enhance GPU memory settings (sgl-project#5604)

* [feature] enable pre compile jit deep_gemm (sgl-project#5580)

* Clean up mem settings (sgl-project#5610)

* Support aiter RMSNorm in AMD (sgl-project#5510)

Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>

* chore: bump v0.4.5.post3 (sgl-project#5611)

* Remove extra copy in deepseek forward absorb (sgl-project#5578)

Co-authored-by: saienduri <saimanas.enduri@amd.com>

* [Doc] Fix a 404 link to llama-405b (sgl-project#5615)

Signed-off-by: windsonsea <haifeng.yao@daocloud.io>

* [fix] force use deepgemm in compile_deep_gemm (sgl-project#5618)

* [fix] fix compile_deep_gemm missing kv_b_proj (sgl-project#5620)

* fix: gemma 3 not use softcap (sgl-project#5622)

* Fix FA3 DeepSeek prefill performance regression (sgl-project#5624)

Co-authored-by: ispobock <ispobaoke@gmail.com>

* [NFC] Remove duplicate `compressed-tensors` (sgl-project#5640)

* Fix shared experts fusion error without quantization (sgl-project#5632)

* [feature] Add H20 fp8_w8a8 FusedMoE config for --n-share-experts-fusion=16 (sgl-project#5641)

Co-authored-by: yuethe <yuethe@tencent.com>

* fix flashmla bug (sgl-project#5272)

* [fix] reduce dp capture bs (sgl-project#5634)

Co-authored-by: alcanerian <alcanerian@gmail.com>

* Remove q concat in FA3 backend for DeepSeek decode (sgl-project#5638)

* Revert "Support aiter RMSNorm in AMD" (sgl-project#5646)

* fix: update bench_speculative (sgl-project#5649)

* Turn on DeepGemm By Default and Update Doc (sgl-project#5628)

* Fuse q_a_proj and kv_a_proj (sgl-project#5619)

* Remove unnecessary `torch.full` in DeepSeek (sgl-project#5601)

* [1/2] Add FP8 Blockscale MoE CUTLASS kernel for Blackwell (sgl-project#5281)

* fix sgl-kernel unit tests (sgl-project#5666)

* fix awq_dequantize import (sgl-project#5669)

* Integrating PD disaggregation with DP attention and DeepEP (sgl-project#5435)

Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>

* fix gemma3 unit test (sgl-project#5670)

* fix torchvision::nms not exist (sgl-project#5671)

* [PD] Add support for dp attention with mooncake (sgl-project#5530)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* tune the threshold of gemma-2-27b-it in test_nightly_gsm8k_eval.py (sgl-project#5677)

* [Doc] Fix two 404 links caused by sglang typo (sgl-project#5667)

Signed-off-by: windsonsea <haifeng.yao@daocloud.io>

* fix: update truss bench_serving (sgl-project#5683)

* fix: only compile ApplyTokenBitmaskInplace cu124+ (sgl-project#5686)

* chore: bump sgl-kernel 0.1.0 (sgl-project#5688)

* vlm: enable radix cache for qwen-vl models (sgl-project#5349)

Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>

* [BugFix] Fix combination of MTP and `--n-share-experts-fusion`with R1 (sgl-project#5707)

* Fix weight loading bug for Deepseek v3+nextn (sgl-project#5684)

* Add example to use sgl engine with fastapi (sgl-project#5648)

Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>

* [Doc] Fix a link to Weilin Zhao (sgl-project#5706)

Signed-off-by: windsonsea <haifeng.yao@daocloud.io>

* Add MMMU benchmark results (sgl-project#4491)

Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>

* [Model] Support `ArcticForCausalLM` architecture (Snowflake/snowflake-arctic-instruct) (sgl-project#5078)

Co-authored-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com>

* [PD] Better logs (sgl-project#5715)

* [PD] Add kvargs table and thread pool for kvcache sender of mooncake (sgl-project#5738)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* [PD]: Support Muti Prefill in one node (sgl-project#5704)

Co-authored-by: shuaills <shishuaiuoe@gmail.com>

* Fix: deepseek forward absorb (sgl-project#5723)

Co-authored-by: ispobock <ispobaoke@163.com>

* Pin torch audio to 2.6.0 (sgl-project#5750)

* Revert "[Model] Support `ArcticForCausalLM` architecture (Snowflake/snowflake-arctic-instruct)" (sgl-project#5754)

* Disable flaky eagle tests (sgl-project#5753)

* update triton 3.2.0 h200 fused moe triton config and add warning about triton fused_moe_kernel performance degradation due to different Triton versions. (sgl-project#5740)

* [Docs] Update runtime/engine/readme.md (sgl-project#5737)

Signed-off-by: windsonsea <haifeng.yao@daocloud.io>

* Reorder loop in shared expert weight loading (sgl-project#5719)

* fix: fix one more bug from merging mm_inputs (sgl-project#5718)

Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com>

* [Fix]: support deepseek-vl2-tiny model (sgl-project#5552)

Co-authored-by: bppps <zouyu.zzx@alibaba-inc.com>

* Bugfix for minicpmo vision test (sgl-project#5760)

* [Minor] fix documentations (sgl-project#5756)

* Add an assertion to enhance the robustness of the operator (sgl-project#5736)

* fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512 (sgl-project#5733)

* Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (sgl-project#5728)

* [fix] fix potential bumpy throughtput with deepgemm (sgl-project#5722)

* Resolves the `404 Not Found` error when running `compile_deep_gemm.py` in multi-node setups (sgl-project#5720)

* perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (sgl-project#5716)

* we fix the non existent access of `decrypted_config_file` (sgl-project#5685)

* CI: rewrite test_vision_chunked_prefill to speedup (sgl-project#5682)

* Fuse MLA set kv cache kernel (sgl-project#5748)

* Update amd docker image to `sglang:v0.4.5.post3-rocm630`. (sgl-project#5697)

* [feature] support for roberta embedding models (sgl-project#5730)

* [fix] fix bench_one_batch_server (sgl-project#5607)

* support for the DeepSeek model by enabling streaming response parsing (sgl-project#5592)

* fix: Use `is not None` instead of `!= None` for None checks. (sgl-project#5687)

* Add Llama 4 to FA3 test (sgl-project#5509)

* [misc] more decode step log for batch_one_batch (sgl-project#5565)

* Handle JSONDecodeError while processing request data (sgl-project#5599)

* fix(srt): check if sample_indices is not None before usage. (sgl-project#5633)

* update llguidance to 0.7.11; adds StructTag (sgl-project#4870)

* Use sgl-kernel sgl_per_token_group_quant_int8 (sgl-project#4971)

* Add memory_saver check (sgl-project#4986)

Signed-off-by: Kebe <mail@kebe7jun.com>

* add switch to disable open api doc (sgl-project#3744)

Signed-off-by: congcongke <zhanweidu@163.com>

* Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (sgl-project#5772)

* Fix eagle test case (sgl-project#5776)

* Split local attention test from fa3 test (sgl-project#5774)

* Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (sgl-project#5777)

* Simplify FA3 tests (sgl-project#5779)

* Revert "[fix] fix bench_one_batch_server" (sgl-project#5785)

* Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (sgl-project#5786)

* [CI] Tune threshold (sgl-project#5787)

* [CI] fix port conflicts (sgl-project#5789)

* [CI] Fix ci tests (sgl-project#5769)

* [PD]Reduce kv transfer threads (sgl-project#5791)

* [CI] Fix test case (sgl-project#5790)

* Add 8-GPU Test for Deepseek-V3  (sgl-project#5691)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* Release v0.4.6 (sgl-project#5795)

* Update nightly-test.yml (sgl-project#5797)

* [CI] Improve github summary & enable fa3 for more models (sgl-project#5796)

* [Docs] update grafana setup guide in production metrics (sgl-project#5643)

Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com>

* [Misc] add structure logging, write to file and log tracing for SGL Router

* Improve overlap scheduling (sgl-project#5788)

* Add Cutlass MLA attention backend (sgl-project#5390)

* chore: upgrade sgl-kernel 0.1.0 (sgl-project#5690)

* Dockerfile.dev pip scikit_build_core (sgl-project#5807)

* Add a doc to fix sgl-kernel build link error in py39 with ccache (sgl-project#5809)

* Turn on overlap scheduler for multimodal models (sgl-project#5771)

* Tiny refactor DefaultModelLoader.Source (sgl-project#5482)

* [Docs] Replace lists with tables for cleanup and readability in server_arguments (sgl-project#5276)

* Revert "Tiny refactor DefaultModelLoader.Source" (sgl-project#5825)

* Feat: add support for thinking mode via chat_template_kwargs.enable_t… (sgl-project#5551)

Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>

* fix: fix the error where the content is None when reasoning and tool … (sgl-project#5838)

* feat: Add fused moe triton config for qwen3 moe on h100 (sgl-project#5833)

* fused moe triton tuning script support qwen3 (sgl-project#5842)

* feat: Add fused moe triton config for qwen3bf16 moe on h20 (sgl-project#5839)

* [PD] support pd fake transfer for warmup (sgl-project#5726)

* [config] qwen3moe_tune_h20 fp8 tp4 (sgl-project#5846)

* [Doc] Recover history of server_arguments.md (sgl-project#5851)

* feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (sgl-project#5850)

* [CI] test chunked prefill more (sgl-project#5798)

* ROCm: update AITER (sgl-project#5816)

* [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (sgl-project#5847)

Co-authored-by: sighingnow <sighingnow@gmail.com>

* [Fix] Missing bootstrap_port field (sgl-project#5823)

* feat: update is_fa3_default_architecture (sgl-project#5854)

* add fused moe config for qwen3moe fp8/bf16 (sgl-project#5849)

* chore: bump v0.4.6.post1 (sgl-project#5845)

* fix for hpu backend in model runner and server args

Signed-off-by: Mohit Sinha <msinha@habana.ai>

* rebase formatting issue

Signed-off-by: Mohit Sinha <msinha@habana.ai>

* [SW-228218]: Fix device mismatch in frequency penalty.

Ensure tensors in BatchedFrequencyPenalizer are on the same device by
moving output_ids and frequency_penalties to the device of
cumulated_frequency_penalties. This resolves a RuntimeError
caused by tensors on cpu and hpu:0 during logits subtraction.

---------

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: congcongke <zhanweidu@163.com>
Signed-off-by: Mohit Sinha <msinha@habana.ai>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
Co-authored-by: Zhaoyang Hao <77828610+Muuuchen@users.noreply.github.com>
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: lambert0312 <lambert80.ios@gmail.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: ybyang <ybyang7@iflytek.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Trevor Morris <tmorris@nvidia.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: mRSun15 <3150105645@zju.edu.cn>
Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com>
Co-authored-by: Yuhao Yang <yyh073@foxmail.com>
Co-authored-by: Michael Yao <haifeng.yao@daocloud.io>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com>
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: BearBiscuit <55008898+BearBiscuit05@users.noreply.github.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com>
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Didier Durand <durand.didier@gmail.com>
Co-authored-by: woodx <124784234+woodx9@users.noreply.github.com>
Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com>
Co-authored-by: PGFLMG <1106310035@qq.com>
Co-authored-by: u4lr451 <u4lr451@gmail.com>
Co-authored-by: ocss884 <ocss.lin@gmail.com>
Co-authored-by: Michael Feil <63565275+michaelfeil@users.noreply.github.com>
Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com>
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
Co-authored-by: liwenju0 <like4hub@gmail.com>
Co-authored-by: Wenxuan Tan <wtan45@wisc.edu>
Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com>
Co-authored-by: Yubo Wang <yubowang2019@gmail.com>
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: Zhaoyi Li <36555117+Lzy17@users.noreply.github.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com>
Co-authored-by: AmadeusW <41280211+Amadeus-Winarto@users.noreply.github.com>
Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com>
Co-authored-by: Yi Zhou <zhouyi920521@gmail.com>
Co-authored-by: simveit <69345428+simveit@users.noreply.github.com>
Co-authored-by: kyle-pena-kuzco <kyle.pena@kuzco.xyz>
Co-authored-by: Kyle Pena <kylepena@kyles-macbook-pro.turkey-marlin.ts.net>
Co-authored-by: Enrique Shockwave <33002121+qeternity@users.noreply.github.com>
Co-authored-by: Juwan Yoo <ryan@tmfi.us>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: mac0ne <mac0ne@users.noreply.github.com>
Co-authored-by: Sundara Raman Ramachandran <sundar24295@gmail.com>
Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
Co-authored-by: moontidef <53668275+relic-yuexi@users.noreply.github.com>
Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com>
Co-authored-by: Lucius <souzou@foxmail.com>
Co-authored-by: Chuyue Sun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Shan Yu <shanyu1@g.ucla.edu>
Co-authored-by: Yongtong Wu <914554688@qq.com>
Co-authored-by: michael-amd <Michael.Zhang@amd.com>
Co-authored-by: Ke Bao <ISPObaoke@163.com>
Co-authored-by: saienduri <saimanas.enduri@amd.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: Connector Switch <c8ef@outlook.com>
Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com>
Co-authored-by: yuethe <yuethe@tencent.com>
Co-authored-by: alcanerian <alcanerian@gmail.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: Ravi Theja <ravi03071991@gmail.com>
Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>
Co-authored-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com>
Co-authored-by: IAN <50618241+hcyz33@users.noreply.github.com>
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: ZXN <44322223+bppps@users.noreply.github.com>
Co-authored-by: bppps <zouyu.zzx@alibaba-inc.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com>
Co-authored-by: vzed <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: DavidBao <121073073+DavidBao03@users.noreply.github.com>
Co-authored-by: Frankey_8080 <32973306+Frank-Jie@users.noreply.github.com>
Co-authored-by: yan97ao <580776+yan97ao@users.noreply.github.com>
Co-authored-by: aoshen524 <aoshen524@gmail.com>
Co-authored-by: Michał Moskal <michal@moskal.me>
Co-authored-by: Kebe <mail@kebe7jun.com>
Co-authored-by: zhanweidu <zhanweidu@163.com>
Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com>
Co-authored-by: Simo Lin <linsimo.mark@gmail.com>
Co-authored-by: JiLi <leege233@gmail.com>
Co-authored-by: sighingnow <sighingnow@gmail.com>
Co-authored-by: XTY <xutianyi1999@live.com>
Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai>
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request May 23, 2025
* Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (sgl-project#5728)

* [fix] fix potential bumpy throughtput with deepgemm (sgl-project#5722)

* Resolves the `404 Not Found` error when running `compile_deep_gemm.py` in multi-node setups (sgl-project#5720)

* perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (sgl-project#5716)

* we fix the non existent access of `decrypted_config_file` (sgl-project#5685)

* CI: rewrite test_vision_chunked_prefill to speedup (sgl-project#5682)

* Fuse MLA set kv cache kernel (sgl-project#5748)

* Update amd docker image to `sglang:v0.4.5.post3-rocm630`. (sgl-project#5697)

* [feature] support for roberta embedding models (sgl-project#5730)

* [fix] fix bench_one_batch_server (sgl-project#5607)

* support for the DeepSeek model by enabling streaming response parsing (sgl-project#5592)

* fix: Use `is not None` instead of `!= None` for None checks. (sgl-project#5687)

* Add Llama 4 to FA3 test (sgl-project#5509)

* [misc] more decode step log for batch_one_batch (sgl-project#5565)

* Handle JSONDecodeError while processing request data (sgl-project#5599)

* fix(srt): check if sample_indices is not None before usage. (sgl-project#5633)

* update llguidance to 0.7.11; adds StructTag (sgl-project#4870)

* Use sgl-kernel sgl_per_token_group_quant_int8 (sgl-project#4971)

* Add memory_saver check (sgl-project#4986)

Signed-off-by: Kebe <mail@kebe7jun.com>

* add switch to disable open api doc (sgl-project#3744)

Signed-off-by: congcongke <zhanweidu@163.com>

* Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (sgl-project#5772)

* Fix eagle test case (sgl-project#5776)

* Split local attention test from fa3 test (sgl-project#5774)

* Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (sgl-project#5777)

* Simplify FA3 tests (sgl-project#5779)

* Revert "[fix] fix bench_one_batch_server" (sgl-project#5785)

* Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (sgl-project#5786)

* [CI] Tune threshold (sgl-project#5787)

* [CI] fix port conflicts (sgl-project#5789)

* [CI] Fix ci tests (sgl-project#5769)

* [PD]Reduce kv transfer threads (sgl-project#5791)

* [CI] Fix test case (sgl-project#5790)

* Add 8-GPU Test for Deepseek-V3  (sgl-project#5691)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* Release v0.4.6 (sgl-project#5795)

* Update nightly-test.yml (sgl-project#5797)

* [CI] Improve github summary & enable fa3 for more models (sgl-project#5796)

* [Docs] update grafana setup guide in production metrics (sgl-project#5643)

Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com>

* [Misc] add structure logging, write to file and log tracing for SGL Router

* Improve overlap scheduling (sgl-project#5788)

* Add Cutlass MLA attention backend (sgl-project#5390)

* chore: upgrade sgl-kernel 0.1.0 (sgl-project#5690)

* Dockerfile.dev pip scikit_build_core (sgl-project#5807)

* Add a doc to fix sgl-kernel build link error in py39 with ccache (sgl-project#5809)

* Turn on overlap scheduler for multimodal models (sgl-project#5771)

* Tiny refactor DefaultModelLoader.Source (sgl-project#5482)

* [Docs] Replace lists with tables for cleanup and readability in server_arguments (sgl-project#5276)

* Revert "Tiny refactor DefaultModelLoader.Source" (sgl-project#5825)

* Feat: add support for thinking mode via chat_template_kwargs.enable_t… (sgl-project#5551)

Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>

* fix: fix the error where the content is None when reasoning and tool … (sgl-project#5838)

* feat: Add fused moe triton config for qwen3 moe on h100 (sgl-project#5833)

* fused moe triton tuning script support qwen3 (sgl-project#5842)

* feat: Add fused moe triton config for qwen3bf16 moe on h20 (sgl-project#5839)

* [PD] support pd fake transfer for warmup (sgl-project#5726)

* [config] qwen3moe_tune_h20 fp8 tp4 (sgl-project#5846)

* [Doc] Recover history of server_arguments.md (sgl-project#5851)

* feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (sgl-project#5850)

* [CI] test chunked prefill more (sgl-project#5798)

* ROCm: update AITER (sgl-project#5816)

* [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (sgl-project#5847)

Co-authored-by: sighingnow <sighingnow@gmail.com>

* [Fix] Missing bootstrap_port field (sgl-project#5823)

* feat: update is_fa3_default_architecture (sgl-project#5854)

* add fused moe config for qwen3moe fp8/bf16 (sgl-project#5849)

* chore: bump v0.4.6.post1 (sgl-project#5845)

* Support `max_completion_tokens` for OpenAIChatCompletions (sgl-project#5857)

* simplify fused_moe config logging (sgl-project#5801)

* [CI] tune the test order to warmup the server (sgl-project#5860)

* Cutlass MLA decode - fix dtype error (sgl-project#5868)

* cutlass 3.9 supported to improve fp8_blockwise_gemm (sgl-project#5820)

* [Feature] support auto chat template (sgl-project#4949)

* Feat: support cuda graph for LoRA (sgl-project#4115)

Co-authored-by: Beichen Ma <mabeichen12@gmail.com>

* Add qwen3 30b fused moe config (sgl-project#5859)

* [Fix] Fix a bug for flashmla to run R1 model (sgl-project#5875)

Co-authored-by: pengcuo <dgpengcuo@gmail.com>

* Add A800 fused moe config for qwen3 30b (sgl-project#5880)

* [Misc] add service discovery for sgl router

* [fix]: PyO3 macOS linking and consolidate on tracing for logging

* chore: update Dockerfile (sgl-project#5894)

* [Docs] Update docs for Qwen3 and Qwen3MoE (sgl-project#5836)

* [Doc] Tables instead of bulletpoints for sampling doc (sgl-project#5841)

* chore: update CODEOWNERS (sgl-project#5895)

* [FEATURE] Enhance platform compatibility for ARM (sgl-project#5746)

* [CI] Add test_function_calling.py to run_suite.py (sgl-project#5896)

* Auto set draft model path for MTP (sgl-project#5793)

* [fix] relax mem_fraction_static for h200 (sgl-project#5893)

Co-authored-by: alcanerian <alcanerian@gmail.com>

* feat: support pythonic tool call and index in tool call streaming (sgl-project#5725)

* [Bugfix]: fix missing queue_time_start for requests from grammar_queue (sgl-project#5696)

* Add AMD MI300x Nightly Testing. (sgl-project#5861)

* chore: use torch 2.6 for sgl-kernel build (sgl-project#5898)

* Fix check_env script (sgl-project#5901)

* [PD] Fix Assertion failed: /DeepEP/csrc/kernels/internode.cu:483, condition: ibgda_get_state()->num_rc_per_pe >= num_channels sgl-project#134 (sgl-project#5830)

* Bump Flashinfer to 0.2.5 (sgl-project#5870)

Co-authored-by: Yuhao Chen <yxckeis8@gmail.com>

* [Fix] Unload lora in HF_Runner if needed (sgl-project#5899)

* Add A800 fused moe config for qwen3 235b (sgl-project#5900)

* Add sm_120 for blackwell (sgl-project#5903)

* [Feature] add support kimi vl model (sgl-project#5383)

Co-authored-by: wenju.li <wenju.li@deepctr.cn>

* support vlm benchmark profile (sgl-project#5905)

* [fix] kimi-vl test in test_vision_openai_server.py (sgl-project#5910)

* [Misc] use parallel build for cmake in sgl-kernel (sgl-project#5919)

* [qwen3] support qwen3 ep moe (sgl-project#5917)

Co-authored-by: sleepcoo <sleepcoo@gmail.com>

* Add TP2 MOE benchmarks for AMD. (sgl-project#5909)

* [Feat] Scale up fa3 kernel to sm8x arch (sgl-project#5912)

Co-authored-by: zhyncs <me@zhyncs.com>

* chore: bump sgl-kernel 0.1.1 (sgl-project#5932)

* chore: upgrade sgl-kernel 0.1.1 (sgl-project#5933)

* Remove unused method `calculate_num_image_tokens` from qwen2_vl.py (sgl-project#5783)

* [PP] Add pipeline parallelism (sgl-project#5724)

* Fix lora batch processing when input lora_path contains None (sgl-project#5930)

* add Thor & Spark (sgl-project#5915)

* fix: correct stream response when enable_thinking is set to false (sgl-project#5881)

* fix: update model runner (sgl-project#5934)

* chore: bump v0.4.6.post2 (sgl-project#5939)

* Support XiaomiMiMo/MiMo model inference (sgl-project#5921)

* [PD] Vectorise group_concurrent_contiguous in NumPy (sgl-project#5834)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* Remove extra contiguous (sgl-project#5953)

* Update ci test and doc for MTP api change (sgl-project#5952)

* docs: Fix Qwen model typo (sgl-project#5944)

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>

* Optimize a pad operation to accelerate 25us (sgl-project#5945)

* Properly return error response in vertex_generate HTTP endpoint (sgl-project#5956)

* feat: add concurrency evaluation logic in mmmu benchmark (sgl-project#5782)

* Add 1 gpu perf and 2 gpu accuracy tests for AMD MI300x CI. (sgl-project#5960)

* feat: Refactor DeepSeekV3 function call (sgl-project#5908)

* Remove token in token out in Native API (sgl-project#5967)

* Support InternVL3 (sgl-project#5350)

Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>

* Support MMMU benchmark for  InternVL (sgl-project#5968)

* FA3 speed up: skip len operation and get batch size directly from forward batch (sgl-project#5969)

Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>

* [PD] NIXL backend Prefill TP & Decode TP+DP (sgl-project#5681)

* Fix set kv cache multi-stream (sgl-project#5975)

* Overlap qk norm with two streams (sgl-project#5977)

* fix: only upgrade nccl for cu128 (sgl-project#5986)

* Fix Phi3 serving which was broke by earlier change (sgl-project#5991)

Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>

* [perf] H100 DeepSeek-V3 fused moe tuned config (sgl-project#5998)

* [Fix] Suppress dynamo logging when using flashinfer backend with torch compile (sgl-project#5992)

* [Minor] Fix duplicate method definitions in conversation.py (sgl-project#6012)

Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>

* Fix flaky issues of lora and add multi batch tests (sgl-project#5957)

* Tool Call: Add `chat_template_kwargs` documentation (sgl-project#5679)

* fix: fix broadcast_pyobj breaking VerlEngine (sgl-project#5997)

* [PD] Allow customizing reserved tokens to avoid KV cache waste (sgl-project#6002)

* Update dev container config to support live code sync and improve docker setup guide   (sgl-project#6018)

Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>

* [PD] Optimize disaggregation ib device help info (sgl-project#5781)

* [Test] Add flashmla attention backend test (sgl-project#5587)

* Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" (sgl-project#5555)

* feat: Add a unified merge_state API (sgl-project#5428)

* feat: append more comprehensive fields in messages instead of merely role and content (sgl-project#5996)

* [Security][Bug] Prevent binding to all TCP interfaces (sgl-project#5752)

* Fix prefill OOM error in the case of large page size (sgl-project#5081)

* Fix problem of large page size with chunked prefill (sgl-project#6046)

* docs: add Google Cloud Vertex AI in Adoption and Sponsorship (sgl-project#6047)

* docs: add new blog (sgl-project#6048)

* Fix not "import os" (sgl-project#6057)

* Better PD initialization (sgl-project#5751)

* fix: deepep dockerfile, use pip install deepep. (sgl-project#5885)

* [Fix] Fix and rename flashmla CI test (sgl-project#6045)

* chore: upgrade cutlass 3.9.2 (sgl-project#6004)

Co-authored-by: yizhang2077 <1109276519@qq.com>

* Fix sgl-kernel build on aarch64 platforms (sgl-project#6062)

* Add DeepEP to CI PR Test (sgl-project#5655)

Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>

* fix custom_allreduce namespace (sgl-project#6039)

* feat: add release workflow for SGLang kernels on aarch64 (sgl-project#6010)

Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>

* [Feature] Support for Ascend NPU backend (sgl-project#3853)

Signed-off-by: Song Zhang <gepin.zs@antgroup.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>

* Fix the timeout for 8 gpu tests (sgl-project#6084)

* Hint users DeepEP normal mode is incompatible with CUDA Graph (sgl-project#5014)

* Super tiny fix doc (sgl-project#5233)

* [Doc]Fix description for dp_size argument (sgl-project#6063)

* feat(engine): add bootstrap parameters to generate methods (dynamo) (sgl-project#6075)

* [refactor] slightly tidy fp8 module (sgl-project#5993)

* Clean up fa3 test from 8 gpus (sgl-project#6105)

* Deferring 8 GPU test (sgl-project#6102)

* Update doc for MLA attention backends (sgl-project#6034)

* Clean logs for DeepSeek-V3 launching (sgl-project#6079)

* [CI]Add performance CI for VLM (sgl-project#6038)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

* adding Triton configs for DeepSeekV3 FusedMoE kernel on Blackwell (sgl-project#6111)

* optimize pad operations in fa3 to accelarate 100+us (sgl-project#6077)

* Overlap shared expert and routed expert computations (sgl-project#5121)

* Tiny refactor ModelConfig.from_server_args (sgl-project#5219)

* Tiny refactor weight loading logic (sgl-project#5232)

* [PD] Add control to slow down a server (sgl-project#5572)

* Change AMD test threshold (sgl-project#6091)

* DeepEP normal support deepgemm-contiguous (sgl-project#5626)

Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Xuting Zhou <xutingz@nvidia.com>
Co-authored-by: ZhengHSI <zhenghsi@qq.com>

* [fix] fix pyproject.toml dependencies (sgl-project#6119)

* [Feature] Add FlashAttention3 as a backend for VisionAttention (sgl-project#5764)

Co-authored-by: othame <chenzhu_912@zju.edu.cn>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>

* [perf] dsv3 bmm fallback to bf16 (sgl-project#5662)

* [AMD] switch to custom allreduce regardless of MSCCL setting on ROCm (sgl-project#6097)

* [sgl-kernel] fix: fix cu118 compile error (sgl-project#6123)

Co-authored-by: zhyncs <me@zhyncs.com>

* upgrade xgrammar to 0.1.19 (sgl-project#6129)

* Remove unecessary is_fa3_supported check (sgl-project#6112)

* chore: bump sgl-kernel 0.1.2 (sgl-project#6131)

* docs: update README (sgl-project#6132)

* [Fix] Incorrect Memory Allocation on CUDA:0 by Non-Zero CUDA Processes in TP/DP (sgl-project#5745)

* Cutlass MLA: Disable split kv due to NVIDIA/cutlass#2274 (sgl-project#6101)

* opt flashinfer mla cat (sgl-project#5822)

Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com>

* Update amd nightly concurrency. (sgl-project#6141)

* feat: add thinking_budget (sgl-project#6089)

* [Bugfix] Fix Llama4 gibberish output with long context and CUDA graph (sgl-project#6162)

* fix bug that gpu0 occupies more memory when hicache is turned on (sgl-project#5778)

Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

* chore: bump v0.4.6.post3 (sgl-project#6165)

* KV‑Cache (MHA, MLA): add missing start_layer / end_layer fields to MHATokenToKVPoolHost and MLATokenToKVPoolHost (sgl-project#6016)

Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com>
Co-authored-by: chus-chus <chus-chus@users.noreply.github.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

* [fix] fix determine_n_share_experts_fusion (sgl-project#6118)

* Fix and Clean up chat-template requirement for VLM (sgl-project#6114)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

* [Docs]Delete duplicate content (sgl-project#6146)

Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>

* Revert "feat: add thinking_budget (sgl-project#6089)" (sgl-project#6181)

* Added async_encode method to Engine (sgl-project#4701)

* Fix data parallel perf regression (sgl-project#6183)

* Fix request abortion (sgl-project#6184)

* Add typo checker in pre-commit (sgl-project#6179)

Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>

* Remove duplicate IO Struct test (sgl-project#6180)

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>

* [PD] Add simple unit test for disaggregation feature (sgl-project#5654)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* [CI] Disabled deepep tests temporarily because it takes too much time. (sgl-project#6186)

* feat: support loogle eval (sgl-project#6190)

* [fix] remove mixtral from is_fa3_default_architecture (sgl-project#6191)

* fix: handle None multimodal_inputs during merging and filtering batches in disaggregation decode mode (sgl-project#6169)

* chore: upgrade deepgemm (sgl-project#6073)

* chore: bump sgl-kernel v0.1.2.post1 (sgl-project#6195)

* chore: upgrade sgl-kernel v0.1.2.post1 (sgl-project#6196)

Co-authored-by: alcanderian <alcanderian@gmail.com>

* Handle empty input string for embedding models (sgl-project#5621)

Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>

* doc: fix the erroneous documents and example codes about Alibaba-NLP/gme-Qwen2-VL-2B-Instruct (sgl-project#6199)

* [Docs] minor Qwen3 and reasoning parser docs fix (sgl-project#6032)

* Improve structured outputs: fix race condition, server crash, metrics and style (sgl-project#6188)

* [CI] Reorganize the 8 gpu tests (sgl-project#6192)

* Add dev-deepep docker image (sgl-project#6198)

* Replace time.time() to time.perf_counter() for benchmarking. (sgl-project#6178)

Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>

* Update README.md (sgl-project#6202)

* Fix release-docs.yml to not use python 3.9 (sgl-project#6204)

* Fix start_profile does not support with_stack and record_shapes (sgl-project#6043)

* [doc] add a note for --n-share-experts-fusion args (sgl-project#6154)

* Performing Vocabulary Parallelism for LM Head across Attention TP Groups (sgl-project#5558)

Co-authored-by: liusy58 <liusy58@linux.alibaba.com>

* Update AMD CI docker to v0.4.6.post3-rocm630. (sgl-project#6213)

* Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (sgl-project#6201)

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* [CI] Fix PD mooncake dependency error (sgl-project#6212)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* [CI] Re-enable pd disaggregation test (sgl-project#6231)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* fix some typos (sgl-project#6209)

Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>

* [Docs] Add docs for `SGLANG_` and `SGL_` environment variables (sgl-project#6206)

* [PP] Fix init_memory_pool desync & add PP for mixtral (sgl-project#6223)

* Revert "fix some typos" (sgl-project#6244)

* chore: add hf_xet dep (sgl-project#6243)

* Update AMD nightly deps. (sgl-project#6241)

* [PD] Add support for different TP sizes per DP rank (sgl-project#5922)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* Support incremental streaming of logprob/token_ids between scheduler and detokenizer (sgl-project#6225)

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* fix typo (sgl-project#6248)

* Support tuning moe for llama 4 model (sgl-project#6042)

* Skip the flaky test_stateful_custom_logit_processor (sgl-project#6251)

* [Llama4] Add docs note about enable multimodal (sgl-project#6235)

* [VERL Use Case] Add torch_memory_saver into deps (sgl-project#6247)

* Fix two issues related to `--moe-dense-tp-size=1` (sgl-project#5657)

Co-authored-by: liusy58 <liusy58@linux.alibaba.com>
Co-authored-by: 颉沆 <xiehang.lsy@alibaba-inc.com>

* model(vlm): pixtral (sgl-project#5084)

* [misc] deep_gemm fallback to NVRTC when NVCC not found (sgl-project#6252)

* Enable MI325X AMD CI. (sgl-project#6259)

* chore: bump v0.4.6.post4 (sgl-project#6245)

* formatting fix for the rebased commit for 4.6.0_post4

Signed-off-by: Mohit Sinha <msinha@habana.ai>

* fix issues in model runner and python packages

fix for following issues:
> vLLM dependency for xgrammar==0.1.17
> 'Scheduler' object has no attribute 'device
> 'pp_proxy_tensors' unexpected arg in HPUGraphRunner
> TODO: Add pipeline parallelism support in HPUGraphRunner

Signed-off-by: Mohit Sinha <msinha@habana.ai>

* fix formatting in model runner

Signed-off-by: Mohit Sinha <msinha@habana.ai>

* base grammar fix for the is_terminated case

>  'OutlinesGrammar' object has no attribute 'is_terminated'

Signed-off-by: Mohit Sinha <msinha@habana.ai>

---------

Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: congcongke <zhanweidu@163.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>
Signed-off-by: Song Zhang <gepin.zs@antgroup.com>
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Mohit Sinha <msinha@habana.ai>
Co-authored-by: Wenxuan Tan <wtan45@wisc.edu>
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com>
Co-authored-by: vzed <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Ke Bao <ISPObaoke@163.com>
Co-authored-by: saienduri <saimanas.enduri@amd.com>
Co-authored-by: DavidBao <121073073+DavidBao03@users.noreply.github.com>
Co-authored-by: Frankey_8080 <32973306+Frank-Jie@users.noreply.github.com>
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: yan97ao <580776+yan97ao@users.noreply.github.com>
Co-authored-by: aoshen524 <aoshen524@gmail.com>
Co-authored-by: Michał Moskal <michal@moskal.me>
Co-authored-by: lambert0312 <lambert80.ios@gmail.com>
Co-authored-by: Kebe <mail@kebe7jun.com>
Co-authored-by: zhanweidu <zhanweidu@163.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com>
Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com>
Co-authored-by: Simo Lin <linsimo.mark@gmail.com>
Co-authored-by: Trevor Morris <tmorris@nvidia.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Michael Yao <haifeng.yao@daocloud.io>
Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com>
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
Co-authored-by: JiLi <leege233@gmail.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: PGFLMG <1106310035@qq.com>
Co-authored-by: sighingnow <sighingnow@gmail.com>
Co-authored-by: XTY <xutianyi1999@live.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: woodx <124784234+woodx9@users.noreply.github.com>
Co-authored-by: Qiaolin Yu <qy254@cornell.edu>
Co-authored-by: Beichen Ma <mabeichen12@gmail.com>
Co-authored-by: pengcuo <pengcbupt@163.com>
Co-authored-by: pengcuo <dgpengcuo@gmail.com>
Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com>
Co-authored-by: simveit <69345428+simveit@users.noreply.github.com>
Co-authored-by: Johnny <johnnync13@gmail.com>
Co-authored-by: alcanerian <alcanerian@gmail.com>
Co-authored-by: Yuhao Chen <yxckeis8@gmail.com>
Co-authored-by: zhjunqin <zhjunqin@users.noreply.github.com>
Co-authored-by: liwenju0 <like4hub@gmail.com>
Co-authored-by: wenju.li <wenju.li@deepctr.cn>
Co-authored-by: laixin <xielx@shanghaitech.edu.cn>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com>
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: 江家瑋 <36886416+JiangJiaWei1103@users.noreply.github.com>
Co-authored-by: KCFindstr <shimakaze@google.com>
Co-authored-by: xm:D <38322020+xiaomin-D@users.noreply.github.com>
Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>
Co-authored-by: Yongtong Wu <914554688@qq.com>
Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: Hank Han <54751605+HanHan009527@users.noreply.github.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com>
Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>
Co-authored-by: Johnny <johnnynuca14@gmail.com>
Co-authored-by: Song Zhang <70674731+botieking98@users.noreply.github.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com>
Co-authored-by: Xuting Zhou <xutingz@nvidia.com>
Co-authored-by: ZhengHSI <zhenghsi@qq.com>
Co-authored-by: Zhu Chen <51010608+Othame@users.noreply.github.com>
Co-authored-by: othame <chenzhu_912@zju.edu.cn>
Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
Co-authored-by: Yixin Dong <ubospica@gmail.com>
Co-authored-by: xu-yfei <xu_yfei@qq.com>
Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com>
Co-authored-by: thyecust <tienhoayu@gmail.com>
Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com>
Co-authored-by: Simon (Jiyou) Li <Simon-Li@users.noreply.github.com>
Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com>
Co-authored-by: chus-chus <chus-chus@users.noreply.github.com>
Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com>
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
Co-authored-by: Steven Shimizu <shimizust@gmail.com>
Co-authored-by: applesaucethebun <113181361+applesaucethebun@users.noreply.github.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Yusong Gao <yusong.gao@gmail.com>
Co-authored-by: alcanderian <alcanderian@gmail.com>
Co-authored-by: Ravi Theja <ravi03071991@gmail.com>
Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>
Co-authored-by: liusy58 <liusy58@linux.alibaba.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: 颉沆 <xiehang.lsy@alibaba-inc.com>
Co-authored-by: Kiv Chen <34561254+KivenChen@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge The PR is ready to merge after the CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants