Skip to content

Conversation

fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Apr 7, 2025

Motivation

Before:

  • bs1: 120.30
  • bs2: 205.36

After:

  • bs1: 135.62
  • bs2: 223.26

Commands:

python3 -m sglang.launch_server --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --port 20000 --tp 8 --mem-fraction-static 0.8 --context-length 8192 --disable-radix-cache
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 50 --random-input 1000 --random-output 1000 --random-range-ratio 1.0 --max-concurrency 1 --port 20000
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 50 --random-input 1000 --random-output 1000 --random-range-ratio 1.0 --max-concurrency 2 --port 20000

Modifications

Checklist

@fzyzcjy fzyzcjy marked this pull request as draft April 7, 2025 09:15
@fzyzcjy fzyzcjy marked this pull request as ready for review April 7, 2025 09:39
fzyzcjy added 3 commits April 9, 2025 17:16
@merrymercy merrymercy added the ready-to-merge The PR is ready to merge after the CI is green. label Apr 27, 2025
@zhyncs zhyncs merged commit 3b2680a into sgl-project:main May 8, 2025
0 of 24 checks passed
RunkaiTao pushed a commit to RunkaiTao/sglang that referenced this pull request May 9, 2025
lifuhuang pushed a commit to lifuhuang/sglang that referenced this pull request May 17, 2025
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request May 23, 2025
* Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (sgl-project#5728)

* [fix] fix potential bumpy throughtput with deepgemm (sgl-project#5722)

* Resolves the `404 Not Found` error when running `compile_deep_gemm.py` in multi-node setups (sgl-project#5720)

* perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (sgl-project#5716)

* we fix the non existent access of `decrypted_config_file` (sgl-project#5685)

* CI: rewrite test_vision_chunked_prefill to speedup (sgl-project#5682)

* Fuse MLA set kv cache kernel (sgl-project#5748)

* Update amd docker image to `sglang:v0.4.5.post3-rocm630`. (sgl-project#5697)

* [feature] support for roberta embedding models (sgl-project#5730)

* [fix] fix bench_one_batch_server (sgl-project#5607)

* support for the DeepSeek model by enabling streaming response parsing (sgl-project#5592)

* fix: Use `is not None` instead of `!= None` for None checks. (sgl-project#5687)

* Add Llama 4 to FA3 test (sgl-project#5509)

* [misc] more decode step log for batch_one_batch (sgl-project#5565)

* Handle JSONDecodeError while processing request data (sgl-project#5599)

* fix(srt): check if sample_indices is not None before usage. (sgl-project#5633)

* update llguidance to 0.7.11; adds StructTag (sgl-project#4870)

* Use sgl-kernel sgl_per_token_group_quant_int8 (sgl-project#4971)

* Add memory_saver check (sgl-project#4986)

Signed-off-by: Kebe <mail@kebe7jun.com>

* add switch to disable open api doc (sgl-project#3744)

Signed-off-by: congcongke <zhanweidu@163.com>

* Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (sgl-project#5772)

* Fix eagle test case (sgl-project#5776)

* Split local attention test from fa3 test (sgl-project#5774)

* Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (sgl-project#5777)

* Simplify FA3 tests (sgl-project#5779)

* Revert "[fix] fix bench_one_batch_server" (sgl-project#5785)

* Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (sgl-project#5786)

* [CI] Tune threshold (sgl-project#5787)

* [CI] fix port conflicts (sgl-project#5789)

* [CI] Fix ci tests (sgl-project#5769)

* [PD]Reduce kv transfer threads (sgl-project#5791)

* [CI] Fix test case (sgl-project#5790)

* Add 8-GPU Test for Deepseek-V3  (sgl-project#5691)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

* Release v0.4.6 (sgl-project#5795)

* Update nightly-test.yml (sgl-project#5797)

* [CI] Improve github summary & enable fa3 for more models (sgl-project#5796)

* [Docs] update grafana setup guide in production metrics (sgl-project#5643)

Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com>

* [Misc] add structure logging, write to file and log tracing for SGL Router

* Improve overlap scheduling (sgl-project#5788)

* Add Cutlass MLA attention backend (sgl-project#5390)

* chore: upgrade sgl-kernel 0.1.0 (sgl-project#5690)

* Dockerfile.dev pip scikit_build_core (sgl-project#5807)

* Add a doc to fix sgl-kernel build link error in py39 with ccache (sgl-project#5809)

* Turn on overlap scheduler for multimodal models (sgl-project#5771)

* Tiny refactor DefaultModelLoader.Source (sgl-project#5482)

* [Docs] Replace lists with tables for cleanup and readability in server_arguments (sgl-project#5276)

* Revert "Tiny refactor DefaultModelLoader.Source" (sgl-project#5825)

* Feat: add support for thinking mode via chat_template_kwargs.enable_t… (sgl-project#5551)

Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>

* fix: fix the error where the content is None when reasoning and tool … (sgl-project#5838)

* feat: Add fused moe triton config for qwen3 moe on h100 (sgl-project#5833)

* fused moe triton tuning script support qwen3 (sgl-project#5842)

* feat: Add fused moe triton config for qwen3bf16 moe on h20 (sgl-project#5839)

* [PD] support pd fake transfer for warmup (sgl-project#5726)

* [config] qwen3moe_tune_h20 fp8 tp4 (sgl-project#5846)

* [Doc] Recover history of server_arguments.md (sgl-project#5851)

* feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (sgl-project#5850)

* [CI] test chunked prefill more (sgl-project#5798)

* ROCm: update AITER (sgl-project#5816)

* [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (sgl-project#5847)

Co-authored-by: sighingnow <sighingnow@gmail.com>

* [Fix] Missing bootstrap_port field (sgl-project#5823)

* feat: update is_fa3_default_architecture (sgl-project#5854)

* add fused moe config for qwen3moe fp8/bf16 (sgl-project#5849)

* chore: bump v0.4.6.post1 (sgl-project#5845)

* Support `max_completion_tokens` for OpenAIChatCompletions (sgl-project#5857)

* simplify fused_moe config logging (sgl-project#5801)

* [CI] tune the test order to warmup the server (sgl-project#5860)

* Cutlass MLA decode - fix dtype error (sgl-project#5868)

* cutlass 3.9 supported to improve fp8_blockwise_gemm (sgl-project#5820)

* [Feature] support auto chat template (sgl-project#4949)

* Feat: support cuda graph for LoRA (sgl-project#4115)

Co-authored-by: Beichen Ma <mabeichen12@gmail.com>

* Add qwen3 30b fused moe config (sgl-project#5859)

* [Fix] Fix a bug for flashmla to run R1 model (sgl-project#5875)

Co-authored-by: pengcuo <dgpengcuo@gmail.com>

* Add A800 fused moe config for qwen3 30b (sgl-project#5880)

* [Misc] add service discovery for sgl router

* [fix]: PyO3 macOS linking and consolidate on tracing for logging

* chore: update Dockerfile (sgl-project#5894)

* [Docs] Update docs for Qwen3 and Qwen3MoE (sgl-project#5836)

* [Doc] Tables instead of bulletpoints for sampling doc (sgl-project#5841)

* chore: update CODEOWNERS (sgl-project#5895)

* [FEATURE] Enhance platform compatibility for ARM (sgl-project#5746)

* [CI] Add test_function_calling.py to run_suite.py (sgl-project#5896)

* Auto set draft model path for MTP (sgl-project#5793)

* [fix] relax mem_fraction_static for h200 (sgl-project#5893)

Co-authored-by: alcanerian <alcanerian@gmail.com>

* feat: support pythonic tool call and index in tool call streaming (sgl-project#5725)

* [Bugfix]: fix missing queue_time_start for requests from grammar_queue (sgl-project#5696)

* Add AMD MI300x Nightly Testing. (sgl-project#5861)

* chore: use torch 2.6 for sgl-kernel build (sgl-project#5898)

* Fix check_env script (sgl-project#5901)

* [PD] Fix Assertion failed: /DeepEP/csrc/kernels/internode.cu:483, condition: ibgda_get_state()->num_rc_per_pe >= num_channels sgl-project#134 (sgl-project#5830)

* Bump Flashinfer to 0.2.5 (sgl-project#5870)

Co-authored-by: Yuhao Chen <yxckeis8@gmail.com>

* [Fix] Unload lora in HF_Runner if needed (sgl-project#5899)

* Add A800 fused moe config for qwen3 235b (sgl-project#5900)

* Add sm_120 for blackwell (sgl-project#5903)

* [Feature] add support kimi vl model (sgl-project#5383)

Co-authored-by: wenju.li <wenju.li@deepctr.cn>

* support vlm benchmark profile (sgl-project#5905)

* [fix] kimi-vl test in test_vision_openai_server.py (sgl-project#5910)

* [Misc] use parallel build for cmake in sgl-kernel (sgl-project#5919)

* [qwen3] support qwen3 ep moe (sgl-project#5917)

Co-authored-by: sleepcoo <sleepcoo@gmail.com>

* Add TP2 MOE benchmarks for AMD. (sgl-project#5909)

* [Feat] Scale up fa3 kernel to sm8x arch (sgl-project#5912)

Co-authored-by: zhyncs <me@zhyncs.com>

* chore: bump sgl-kernel 0.1.1 (sgl-project#5932)

* chore: upgrade sgl-kernel 0.1.1 (sgl-project#5933)

* Remove unused method `calculate_num_image_tokens` from qwen2_vl.py (sgl-project#5783)

* [PP] Add pipeline parallelism (sgl-project#5724)

* Fix lora batch processing when input lora_path contains None (sgl-project#5930)

* add Thor & Spark (sgl-project#5915)

* fix: correct stream response when enable_thinking is set to false (sgl-project#5881)

* fix: update model runner (sgl-project#5934)

* chore: bump v0.4.6.post2 (sgl-project#5939)

* Support XiaomiMiMo/MiMo model inference (sgl-project#5921)

* [PD] Vectorise group_concurrent_contiguous in NumPy (sgl-project#5834)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* Remove extra contiguous (sgl-project#5953)

* Update ci test and doc for MTP api change (sgl-project#5952)

* docs: Fix Qwen model typo (sgl-project#5944)

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>

* Optimize a pad operation to accelerate 25us (sgl-project#5945)

* Properly return error response in vertex_generate HTTP endpoint (sgl-project#5956)

* feat: add concurrency evaluation logic in mmmu benchmark (sgl-project#5782)

* Add 1 gpu perf and 2 gpu accuracy tests for AMD MI300x CI. (sgl-project#5960)

* feat: Refactor DeepSeekV3 function call (sgl-project#5908)

* Remove token in token out in Native API (sgl-project#5967)

* Support InternVL3 (sgl-project#5350)

Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>

* Support MMMU benchmark for  InternVL (sgl-project#5968)

* FA3 speed up: skip len operation and get batch size directly from forward batch (sgl-project#5969)

Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>

* [PD] NIXL backend Prefill TP & Decode TP+DP (sgl-project#5681)

* Fix set kv cache multi-stream (sgl-project#5975)

* Overlap qk norm with two streams (sgl-project#5977)

* fix: only upgrade nccl for cu128 (sgl-project#5986)

* Fix Phi3 serving which was broke by earlier change (sgl-project#5991)

Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>

* [perf] H100 DeepSeek-V3 fused moe tuned config (sgl-project#5998)

* [Fix] Suppress dynamo logging when using flashinfer backend with torch compile (sgl-project#5992)

* [Minor] Fix duplicate method definitions in conversation.py (sgl-project#6012)

Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>

* Fix flaky issues of lora and add multi batch tests (sgl-project#5957)

* Tool Call: Add `chat_template_kwargs` documentation (sgl-project#5679)

* fix: fix broadcast_pyobj breaking VerlEngine (sgl-project#5997)

* [PD] Allow customizing reserved tokens to avoid KV cache waste (sgl-project#6002)

* Update dev container config to support live code sync and improve docker setup guide   (sgl-project#6018)

Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>

* [PD] Optimize disaggregation ib device help info (sgl-project#5781)

* [Test] Add flashmla attention backend test (sgl-project#5587)

* Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" (sgl-project#5555)

* feat: Add a unified merge_state API (sgl-project#5428)

* feat: append more comprehensive fields in messages instead of merely role and content (sgl-project#5996)

* [Security][Bug] Prevent binding to all TCP interfaces (sgl-project#5752)

* Fix prefill OOM error in the case of large page size (sgl-project#5081)

* Fix problem of large page size with chunked prefill (sgl-project#6046)

* docs: add Google Cloud Vertex AI in Adoption and Sponsorship (sgl-project#6047)

* docs: add new blog (sgl-project#6048)

* Fix not "import os" (sgl-project#6057)

* Better PD initialization (sgl-project#5751)

* fix: deepep dockerfile, use pip install deepep. (sgl-project#5885)

* [Fix] Fix and rename flashmla CI test (sgl-project#6045)

* chore: upgrade cutlass 3.9.2 (sgl-project#6004)

Co-authored-by: yizhang2077 <1109276519@qq.com>

* Fix sgl-kernel build on aarch64 platforms (sgl-project#6062)

* Add DeepEP to CI PR Test (sgl-project#5655)

Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>

* fix custom_allreduce namespace (sgl-project#6039)

* feat: add release workflow for SGLang kernels on aarch64 (sgl-project#6010)

Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>

* [Feature] Support for Ascend NPU backend (sgl-project#3853)

Signed-off-by: Song Zhang <gepin.zs@antgroup.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>

* Fix the timeout for 8 gpu tests (sgl-project#6084)

* Hint users DeepEP normal mode is incompatible with CUDA Graph (sgl-project#5014)

* Super tiny fix doc (sgl-project#5233)

* [Doc]Fix description for dp_size argument (sgl-project#6063)

* feat(engine): add bootstrap parameters to generate methods (dynamo) (sgl-project#6075)

* [refactor] slightly tidy fp8 module (sgl-project#5993)

* Clean up fa3 test from 8 gpus (sgl-project#6105)

* Deferring 8 GPU test (sgl-project#6102)

* Update doc for MLA attention backends (sgl-project#6034)

* Clean logs for DeepSeek-V3 launching (sgl-project#6079)

* [CI]Add performance CI for VLM (sgl-project#6038)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

* adding Triton configs for DeepSeekV3 FusedMoE kernel on Blackwell (sgl-project#6111)

* optimize pad operations in fa3 to accelarate 100+us (sgl-project#6077)

* Overlap shared expert and routed expert computations (sgl-project#5121)

* Tiny refactor ModelConfig.from_server_args (sgl-project#5219)

* Tiny refactor weight loading logic (sgl-project#5232)

* [PD] Add control to slow down a server (sgl-project#5572)

* Change AMD test threshold (sgl-project#6091)

* DeepEP normal support deepgemm-contiguous (sgl-project#5626)

Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Xuting Zhou <xutingz@nvidia.com>
Co-authored-by: ZhengHSI <zhenghsi@qq.com>

* [fix] fix pyproject.toml dependencies (sgl-project#6119)

* [Feature] Add FlashAttention3 as a backend for VisionAttention (sgl-project#5764)

Co-authored-by: othame <chenzhu_912@zju.edu.cn>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>

* [perf] dsv3 bmm fallback to bf16 (sgl-project#5662)

* [AMD] switch to custom allreduce regardless of MSCCL setting on ROCm (sgl-project#6097)

* [sgl-kernel] fix: fix cu118 compile error (sgl-project#6123)

Co-authored-by: zhyncs <me@zhyncs.com>

* upgrade xgrammar to 0.1.19 (sgl-project#6129)

* Remove unecessary is_fa3_supported check (sgl-project#6112)

* chore: bump sgl-kernel 0.1.2 (sgl-project#6131)

* docs: update README (sgl-project#6132)

* [Fix] Incorrect Memory Allocation on CUDA:0 by Non-Zero CUDA Processes in TP/DP (sgl-project#5745)

* Cutlass MLA: Disable split kv due to NVIDIA/cutlass#2274 (sgl-project#6101)

* opt flashinfer mla cat (sgl-project#5822)

Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com>

* Update amd nightly concurrency. (sgl-project#6141)

* feat: add thinking_budget (sgl-project#6089)

* [Bugfix] Fix Llama4 gibberish output with long context and CUDA graph (sgl-project#6162)

* fix bug that gpu0 occupies more memory when hicache is turned on (sgl-project#5778)

Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

* chore: bump v0.4.6.post3 (sgl-project#6165)

* KV‑Cache (MHA, MLA): add missing start_layer / end_layer fields to MHATokenToKVPoolHost and MLATokenToKVPoolHost (sgl-project#6016)

Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com>
Co-authored-by: chus-chus <chus-chus@users.noreply.github.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>

* [fix] fix determine_n_share_experts_fusion (sgl-project#6118)

* Fix and Clean up chat-template requirement for VLM (sgl-project#6114)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

* [Docs]Delete duplicate content (sgl-project#6146)

Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>

* Revert "feat: add thinking_budget (sgl-project#6089)" (sgl-project#6181)

* Added async_encode method to Engine (sgl-project#4701)

* Fix data parallel perf regression (sgl-project#6183)

* Fix request abortion (sgl-project#6184)

* Add typo checker in pre-commit (sgl-project#6179)

Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>

* Remove duplicate IO Struct test (sgl-project#6180)

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>

* [PD] Add simple unit test for disaggregation feature (sgl-project#5654)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* [CI] Disabled deepep tests temporarily because it takes too much time. (sgl-project#6186)

* feat: support loogle eval (sgl-project#6190)

* [fix] remove mixtral from is_fa3_default_architecture (sgl-project#6191)

* fix: handle None multimodal_inputs during merging and filtering batches in disaggregation decode mode (sgl-project#6169)

* chore: upgrade deepgemm (sgl-project#6073)

* chore: bump sgl-kernel v0.1.2.post1 (sgl-project#6195)

* chore: upgrade sgl-kernel v0.1.2.post1 (sgl-project#6196)

Co-authored-by: alcanderian <alcanderian@gmail.com>

* Handle empty input string for embedding models (sgl-project#5621)

Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>

* doc: fix the erroneous documents and example codes about Alibaba-NLP/gme-Qwen2-VL-2B-Instruct (sgl-project#6199)

* [Docs] minor Qwen3 and reasoning parser docs fix (sgl-project#6032)

* Improve structured outputs: fix race condition, server crash, metrics and style (sgl-project#6188)

* [CI] Reorganize the 8 gpu tests (sgl-project#6192)

* Add dev-deepep docker image (sgl-project#6198)

* Replace time.time() to time.perf_counter() for benchmarking. (sgl-project#6178)

Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>

* Update README.md (sgl-project#6202)

* Fix release-docs.yml to not use python 3.9 (sgl-project#6204)

* Fix start_profile does not support with_stack and record_shapes (sgl-project#6043)

* [doc] add a note for --n-share-experts-fusion args (sgl-project#6154)

* Performing Vocabulary Parallelism for LM Head across Attention TP Groups (sgl-project#5558)

Co-authored-by: liusy58 <liusy58@linux.alibaba.com>

* Update AMD CI docker to v0.4.6.post3-rocm630. (sgl-project#6213)

* Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (sgl-project#6201)

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* [CI] Fix PD mooncake dependency error (sgl-project#6212)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* [CI] Re-enable pd disaggregation test (sgl-project#6231)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* fix some typos (sgl-project#6209)

Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>

* [Docs] Add docs for `SGLANG_` and `SGL_` environment variables (sgl-project#6206)

* [PP] Fix init_memory_pool desync & add PP for mixtral (sgl-project#6223)

* Revert "fix some typos" (sgl-project#6244)

* chore: add hf_xet dep (sgl-project#6243)

* Update AMD nightly deps. (sgl-project#6241)

* [PD] Add support for different TP sizes per DP rank (sgl-project#5922)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* Support incremental streaming of logprob/token_ids between scheduler and detokenizer (sgl-project#6225)

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* fix typo (sgl-project#6248)

* Support tuning moe for llama 4 model (sgl-project#6042)

* Skip the flaky test_stateful_custom_logit_processor (sgl-project#6251)

* [Llama4] Add docs note about enable multimodal (sgl-project#6235)

* [VERL Use Case] Add torch_memory_saver into deps (sgl-project#6247)

* Fix two issues related to `--moe-dense-tp-size=1` (sgl-project#5657)

Co-authored-by: liusy58 <liusy58@linux.alibaba.com>
Co-authored-by: 颉沆 <xiehang.lsy@alibaba-inc.com>

* model(vlm): pixtral (sgl-project#5084)

* [misc] deep_gemm fallback to NVRTC when NVCC not found (sgl-project#6252)

* Enable MI325X AMD CI. (sgl-project#6259)

* chore: bump v0.4.6.post4 (sgl-project#6245)

* formatting fix for the rebased commit for 4.6.0_post4

Signed-off-by: Mohit Sinha <msinha@habana.ai>

* fix issues in model runner and python packages

fix for following issues:
> vLLM dependency for xgrammar==0.1.17
> 'Scheduler' object has no attribute 'device
> 'pp_proxy_tensors' unexpected arg in HPUGraphRunner
> TODO: Add pipeline parallelism support in HPUGraphRunner

Signed-off-by: Mohit Sinha <msinha@habana.ai>

* fix formatting in model runner

Signed-off-by: Mohit Sinha <msinha@habana.ai>

* base grammar fix for the is_terminated case

>  'OutlinesGrammar' object has no attribute 'is_terminated'

Signed-off-by: Mohit Sinha <msinha@habana.ai>

---------

Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: congcongke <zhanweidu@163.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>
Signed-off-by: Song Zhang <gepin.zs@antgroup.com>
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Mohit Sinha <msinha@habana.ai>
Co-authored-by: Wenxuan Tan <wtan45@wisc.edu>
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com>
Co-authored-by: vzed <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Ke Bao <ISPObaoke@163.com>
Co-authored-by: saienduri <saimanas.enduri@amd.com>
Co-authored-by: DavidBao <121073073+DavidBao03@users.noreply.github.com>
Co-authored-by: Frankey_8080 <32973306+Frank-Jie@users.noreply.github.com>
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: yan97ao <580776+yan97ao@users.noreply.github.com>
Co-authored-by: aoshen524 <aoshen524@gmail.com>
Co-authored-by: Michał Moskal <michal@moskal.me>
Co-authored-by: lambert0312 <lambert80.ios@gmail.com>
Co-authored-by: Kebe <mail@kebe7jun.com>
Co-authored-by: zhanweidu <zhanweidu@163.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com>
Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com>
Co-authored-by: Simo Lin <linsimo.mark@gmail.com>
Co-authored-by: Trevor Morris <tmorris@nvidia.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Michael Yao <haifeng.yao@daocloud.io>
Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com>
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
Co-authored-by: JiLi <leege233@gmail.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: PGFLMG <1106310035@qq.com>
Co-authored-by: sighingnow <sighingnow@gmail.com>
Co-authored-by: XTY <xutianyi1999@live.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: woodx <124784234+woodx9@users.noreply.github.com>
Co-authored-by: Qiaolin Yu <qy254@cornell.edu>
Co-authored-by: Beichen Ma <mabeichen12@gmail.com>
Co-authored-by: pengcuo <pengcbupt@163.com>
Co-authored-by: pengcuo <dgpengcuo@gmail.com>
Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com>
Co-authored-by: simveit <69345428+simveit@users.noreply.github.com>
Co-authored-by: Johnny <johnnync13@gmail.com>
Co-authored-by: alcanerian <alcanerian@gmail.com>
Co-authored-by: Yuhao Chen <yxckeis8@gmail.com>
Co-authored-by: zhjunqin <zhjunqin@users.noreply.github.com>
Co-authored-by: liwenju0 <like4hub@gmail.com>
Co-authored-by: wenju.li <wenju.li@deepctr.cn>
Co-authored-by: laixin <xielx@shanghaitech.edu.cn>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com>
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: 江家瑋 <36886416+JiangJiaWei1103@users.noreply.github.com>
Co-authored-by: KCFindstr <shimakaze@google.com>
Co-authored-by: xm:D <38322020+xiaomin-D@users.noreply.github.com>
Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>
Co-authored-by: Yongtong Wu <914554688@qq.com>
Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: Hank Han <54751605+HanHan009527@users.noreply.github.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com>
Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>
Co-authored-by: Johnny <johnnynuca14@gmail.com>
Co-authored-by: Song Zhang <70674731+botieking98@users.noreply.github.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com>
Co-authored-by: Xuting Zhou <xutingz@nvidia.com>
Co-authored-by: ZhengHSI <zhenghsi@qq.com>
Co-authored-by: Zhu Chen <51010608+Othame@users.noreply.github.com>
Co-authored-by: othame <chenzhu_912@zju.edu.cn>
Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
Co-authored-by: Yixin Dong <ubospica@gmail.com>
Co-authored-by: xu-yfei <xu_yfei@qq.com>
Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com>
Co-authored-by: thyecust <tienhoayu@gmail.com>
Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com>
Co-authored-by: Simon (Jiyou) Li <Simon-Li@users.noreply.github.com>
Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com>
Co-authored-by: chus-chus <chus-chus@users.noreply.github.com>
Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com>
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com>
Co-authored-by: Steven Shimizu <shimizust@gmail.com>
Co-authored-by: applesaucethebun <113181361+applesaucethebun@users.noreply.github.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Yusong Gao <yusong.gao@gmail.com>
Co-authored-by: alcanderian <alcanderian@gmail.com>
Co-authored-by: Ravi Theja <ravi03071991@gmail.com>
Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>
Co-authored-by: liusy58 <liusy58@linux.alibaba.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: 颉沆 <xiehang.lsy@alibaba-inc.com>
Co-authored-by: Kiv Chen <34561254+KivenChen@users.noreply.github.com>
Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025
xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge The PR is ready to merge after the CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants