-
Notifications
You must be signed in to change notification settings - Fork 2.8k
cutlass 3.9 supported to improve fp8_blockwise_gemm #5820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
LGTM QQ: What is the metric of the benchmark? I'm wondering if it's latency or throughput? |
Hi @elfiegg sglang/sgl-kernel/benchmark/bench_fp8_blockwise_gemm.py Lines 115 to 146 in 8d463fe
|
zhyncs
approved these changes
Apr 29, 2025
It is latency. |
liwenju0
pushed a commit
to liwenju0/sglang
that referenced
this pull request
Apr 29, 2025
RunkaiTao
pushed a commit
to RunkaiTao/sglang
that referenced
this pull request
May 9, 2025
pi314ever
pushed a commit
to pi314ever/sglang
that referenced
this pull request
May 23, 2025
* Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (sgl-project#5728) * [fix] fix potential bumpy throughtput with deepgemm (sgl-project#5722) * Resolves the `404 Not Found` error when running `compile_deep_gemm.py` in multi-node setups (sgl-project#5720) * perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (sgl-project#5716) * we fix the non existent access of `decrypted_config_file` (sgl-project#5685) * CI: rewrite test_vision_chunked_prefill to speedup (sgl-project#5682) * Fuse MLA set kv cache kernel (sgl-project#5748) * Update amd docker image to `sglang:v0.4.5.post3-rocm630`. (sgl-project#5697) * [feature] support for roberta embedding models (sgl-project#5730) * [fix] fix bench_one_batch_server (sgl-project#5607) * support for the DeepSeek model by enabling streaming response parsing (sgl-project#5592) * fix: Use `is not None` instead of `!= None` for None checks. (sgl-project#5687) * Add Llama 4 to FA3 test (sgl-project#5509) * [misc] more decode step log for batch_one_batch (sgl-project#5565) * Handle JSONDecodeError while processing request data (sgl-project#5599) * fix(srt): check if sample_indices is not None before usage. (sgl-project#5633) * update llguidance to 0.7.11; adds StructTag (sgl-project#4870) * Use sgl-kernel sgl_per_token_group_quant_int8 (sgl-project#4971) * Add memory_saver check (sgl-project#4986) Signed-off-by: Kebe <mail@kebe7jun.com> * add switch to disable open api doc (sgl-project#3744) Signed-off-by: congcongke <zhanweidu@163.com> * Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (sgl-project#5772) * Fix eagle test case (sgl-project#5776) * Split local attention test from fa3 test (sgl-project#5774) * Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (sgl-project#5777) * Simplify FA3 tests (sgl-project#5779) * Revert "[fix] fix bench_one_batch_server" (sgl-project#5785) * Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (sgl-project#5786) * [CI] Tune threshold (sgl-project#5787) * [CI] fix port conflicts (sgl-project#5789) * [CI] Fix ci tests (sgl-project#5769) * [PD]Reduce kv transfer threads (sgl-project#5791) * [CI] Fix test case (sgl-project#5790) * Add 8-GPU Test for Deepseek-V3 (sgl-project#5691) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * Release v0.4.6 (sgl-project#5795) * Update nightly-test.yml (sgl-project#5797) * [CI] Improve github summary & enable fa3 for more models (sgl-project#5796) * [Docs] update grafana setup guide in production metrics (sgl-project#5643) Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com> * [Misc] add structure logging, write to file and log tracing for SGL Router * Improve overlap scheduling (sgl-project#5788) * Add Cutlass MLA attention backend (sgl-project#5390) * chore: upgrade sgl-kernel 0.1.0 (sgl-project#5690) * Dockerfile.dev pip scikit_build_core (sgl-project#5807) * Add a doc to fix sgl-kernel build link error in py39 with ccache (sgl-project#5809) * Turn on overlap scheduler for multimodal models (sgl-project#5771) * Tiny refactor DefaultModelLoader.Source (sgl-project#5482) * [Docs] Replace lists with tables for cleanup and readability in server_arguments (sgl-project#5276) * Revert "Tiny refactor DefaultModelLoader.Source" (sgl-project#5825) * Feat: add support for thinking mode via chat_template_kwargs.enable_t… (sgl-project#5551) Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> * fix: fix the error where the content is None when reasoning and tool … (sgl-project#5838) * feat: Add fused moe triton config for qwen3 moe on h100 (sgl-project#5833) * fused moe triton tuning script support qwen3 (sgl-project#5842) * feat: Add fused moe triton config for qwen3bf16 moe on h20 (sgl-project#5839) * [PD] support pd fake transfer for warmup (sgl-project#5726) * [config] qwen3moe_tune_h20 fp8 tp4 (sgl-project#5846) * [Doc] Recover history of server_arguments.md (sgl-project#5851) * feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (sgl-project#5850) * [CI] test chunked prefill more (sgl-project#5798) * ROCm: update AITER (sgl-project#5816) * [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (sgl-project#5847) Co-authored-by: sighingnow <sighingnow@gmail.com> * [Fix] Missing bootstrap_port field (sgl-project#5823) * feat: update is_fa3_default_architecture (sgl-project#5854) * add fused moe config for qwen3moe fp8/bf16 (sgl-project#5849) * chore: bump v0.4.6.post1 (sgl-project#5845) * Support `max_completion_tokens` for OpenAIChatCompletions (sgl-project#5857) * simplify fused_moe config logging (sgl-project#5801) * [CI] tune the test order to warmup the server (sgl-project#5860) * Cutlass MLA decode - fix dtype error (sgl-project#5868) * cutlass 3.9 supported to improve fp8_blockwise_gemm (sgl-project#5820) * [Feature] support auto chat template (sgl-project#4949) * Feat: support cuda graph for LoRA (sgl-project#4115) Co-authored-by: Beichen Ma <mabeichen12@gmail.com> * Add qwen3 30b fused moe config (sgl-project#5859) * [Fix] Fix a bug for flashmla to run R1 model (sgl-project#5875) Co-authored-by: pengcuo <dgpengcuo@gmail.com> * Add A800 fused moe config for qwen3 30b (sgl-project#5880) * [Misc] add service discovery for sgl router * [fix]: PyO3 macOS linking and consolidate on tracing for logging * chore: update Dockerfile (sgl-project#5894) * [Docs] Update docs for Qwen3 and Qwen3MoE (sgl-project#5836) * [Doc] Tables instead of bulletpoints for sampling doc (sgl-project#5841) * chore: update CODEOWNERS (sgl-project#5895) * [FEATURE] Enhance platform compatibility for ARM (sgl-project#5746) * [CI] Add test_function_calling.py to run_suite.py (sgl-project#5896) * Auto set draft model path for MTP (sgl-project#5793) * [fix] relax mem_fraction_static for h200 (sgl-project#5893) Co-authored-by: alcanerian <alcanerian@gmail.com> * feat: support pythonic tool call and index in tool call streaming (sgl-project#5725) * [Bugfix]: fix missing queue_time_start for requests from grammar_queue (sgl-project#5696) * Add AMD MI300x Nightly Testing. (sgl-project#5861) * chore: use torch 2.6 for sgl-kernel build (sgl-project#5898) * Fix check_env script (sgl-project#5901) * [PD] Fix Assertion failed: /DeepEP/csrc/kernels/internode.cu:483, condition: ibgda_get_state()->num_rc_per_pe >= num_channels sgl-project#134 (sgl-project#5830) * Bump Flashinfer to 0.2.5 (sgl-project#5870) Co-authored-by: Yuhao Chen <yxckeis8@gmail.com> * [Fix] Unload lora in HF_Runner if needed (sgl-project#5899) * Add A800 fused moe config for qwen3 235b (sgl-project#5900) * Add sm_120 for blackwell (sgl-project#5903) * [Feature] add support kimi vl model (sgl-project#5383) Co-authored-by: wenju.li <wenju.li@deepctr.cn> * support vlm benchmark profile (sgl-project#5905) * [fix] kimi-vl test in test_vision_openai_server.py (sgl-project#5910) * [Misc] use parallel build for cmake in sgl-kernel (sgl-project#5919) * [qwen3] support qwen3 ep moe (sgl-project#5917) Co-authored-by: sleepcoo <sleepcoo@gmail.com> * Add TP2 MOE benchmarks for AMD. (sgl-project#5909) * [Feat] Scale up fa3 kernel to sm8x arch (sgl-project#5912) Co-authored-by: zhyncs <me@zhyncs.com> * chore: bump sgl-kernel 0.1.1 (sgl-project#5932) * chore: upgrade sgl-kernel 0.1.1 (sgl-project#5933) * Remove unused method `calculate_num_image_tokens` from qwen2_vl.py (sgl-project#5783) * [PP] Add pipeline parallelism (sgl-project#5724) * Fix lora batch processing when input lora_path contains None (sgl-project#5930) * add Thor & Spark (sgl-project#5915) * fix: correct stream response when enable_thinking is set to false (sgl-project#5881) * fix: update model runner (sgl-project#5934) * chore: bump v0.4.6.post2 (sgl-project#5939) * Support XiaomiMiMo/MiMo model inference (sgl-project#5921) * [PD] Vectorise group_concurrent_contiguous in NumPy (sgl-project#5834) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Remove extra contiguous (sgl-project#5953) * Update ci test and doc for MTP api change (sgl-project#5952) * docs: Fix Qwen model typo (sgl-project#5944) Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * Optimize a pad operation to accelerate 25us (sgl-project#5945) * Properly return error response in vertex_generate HTTP endpoint (sgl-project#5956) * feat: add concurrency evaluation logic in mmmu benchmark (sgl-project#5782) * Add 1 gpu perf and 2 gpu accuracy tests for AMD MI300x CI. (sgl-project#5960) * feat: Refactor DeepSeekV3 function call (sgl-project#5908) * Remove token in token out in Native API (sgl-project#5967) * Support InternVL3 (sgl-project#5350) Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> * Support MMMU benchmark for InternVL (sgl-project#5968) * FA3 speed up: skip len operation and get batch size directly from forward batch (sgl-project#5969) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * [PD] NIXL backend Prefill TP & Decode TP+DP (sgl-project#5681) * Fix set kv cache multi-stream (sgl-project#5975) * Overlap qk norm with two streams (sgl-project#5977) * fix: only upgrade nccl for cu128 (sgl-project#5986) * Fix Phi3 serving which was broke by earlier change (sgl-project#5991) Co-authored-by: Lifu Huang <lifu.hlf@gmail.com> * [perf] H100 DeepSeek-V3 fused moe tuned config (sgl-project#5998) * [Fix] Suppress dynamo logging when using flashinfer backend with torch compile (sgl-project#5992) * [Minor] Fix duplicate method definitions in conversation.py (sgl-project#6012) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * Fix flaky issues of lora and add multi batch tests (sgl-project#5957) * Tool Call: Add `chat_template_kwargs` documentation (sgl-project#5679) * fix: fix broadcast_pyobj breaking VerlEngine (sgl-project#5997) * [PD] Allow customizing reserved tokens to avoid KV cache waste (sgl-project#6002) * Update dev container config to support live code sync and improve docker setup guide (sgl-project#6018) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * [PD] Optimize disaggregation ib device help info (sgl-project#5781) * [Test] Add flashmla attention backend test (sgl-project#5587) * Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" (sgl-project#5555) * feat: Add a unified merge_state API (sgl-project#5428) * feat: append more comprehensive fields in messages instead of merely role and content (sgl-project#5996) * [Security][Bug] Prevent binding to all TCP interfaces (sgl-project#5752) * Fix prefill OOM error in the case of large page size (sgl-project#5081) * Fix problem of large page size with chunked prefill (sgl-project#6046) * docs: add Google Cloud Vertex AI in Adoption and Sponsorship (sgl-project#6047) * docs: add new blog (sgl-project#6048) * Fix not "import os" (sgl-project#6057) * Better PD initialization (sgl-project#5751) * fix: deepep dockerfile, use pip install deepep. (sgl-project#5885) * [Fix] Fix and rename flashmla CI test (sgl-project#6045) * chore: upgrade cutlass 3.9.2 (sgl-project#6004) Co-authored-by: yizhang2077 <1109276519@qq.com> * Fix sgl-kernel build on aarch64 platforms (sgl-project#6062) * Add DeepEP to CI PR Test (sgl-project#5655) Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> * fix custom_allreduce namespace (sgl-project#6039) * feat: add release workflow for SGLang kernels on aarch64 (sgl-project#6010) Co-authored-by: Qiaolin-Yu <liin1211@outlook.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> * [Feature] Support for Ascend NPU backend (sgl-project#3853) Signed-off-by: Song Zhang <gepin.zs@antgroup.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> * Fix the timeout for 8 gpu tests (sgl-project#6084) * Hint users DeepEP normal mode is incompatible with CUDA Graph (sgl-project#5014) * Super tiny fix doc (sgl-project#5233) * [Doc]Fix description for dp_size argument (sgl-project#6063) * feat(engine): add bootstrap parameters to generate methods (dynamo) (sgl-project#6075) * [refactor] slightly tidy fp8 module (sgl-project#5993) * Clean up fa3 test from 8 gpus (sgl-project#6105) * Deferring 8 GPU test (sgl-project#6102) * Update doc for MLA attention backends (sgl-project#6034) * Clean logs for DeepSeek-V3 launching (sgl-project#6079) * [CI]Add performance CI for VLM (sgl-project#6038) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * adding Triton configs for DeepSeekV3 FusedMoE kernel on Blackwell (sgl-project#6111) * optimize pad operations in fa3 to accelarate 100+us (sgl-project#6077) * Overlap shared expert and routed expert computations (sgl-project#5121) * Tiny refactor ModelConfig.from_server_args (sgl-project#5219) * Tiny refactor weight loading logic (sgl-project#5232) * [PD] Add control to slow down a server (sgl-project#5572) * Change AMD test threshold (sgl-project#6091) * DeepEP normal support deepgemm-contiguous (sgl-project#5626) Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Xuting Zhou <xutingz@nvidia.com> Co-authored-by: ZhengHSI <zhenghsi@qq.com> * [fix] fix pyproject.toml dependencies (sgl-project#6119) * [Feature] Add FlashAttention3 as a backend for VisionAttention (sgl-project#5764) Co-authored-by: othame <chenzhu_912@zju.edu.cn> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yi Zhang <1109276519@qq.com> * [perf] dsv3 bmm fallback to bf16 (sgl-project#5662) * [AMD] switch to custom allreduce regardless of MSCCL setting on ROCm (sgl-project#6097) * [sgl-kernel] fix: fix cu118 compile error (sgl-project#6123) Co-authored-by: zhyncs <me@zhyncs.com> * upgrade xgrammar to 0.1.19 (sgl-project#6129) * Remove unecessary is_fa3_supported check (sgl-project#6112) * chore: bump sgl-kernel 0.1.2 (sgl-project#6131) * docs: update README (sgl-project#6132) * [Fix] Incorrect Memory Allocation on CUDA:0 by Non-Zero CUDA Processes in TP/DP (sgl-project#5745) * Cutlass MLA: Disable split kv due to NVIDIA/cutlass#2274 (sgl-project#6101) * opt flashinfer mla cat (sgl-project#5822) Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com> * Update amd nightly concurrency. (sgl-project#6141) * feat: add thinking_budget (sgl-project#6089) * [Bugfix] Fix Llama4 gibberish output with long context and CUDA graph (sgl-project#6162) * fix bug that gpu0 occupies more memory when hicache is turned on (sgl-project#5778) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * chore: bump v0.4.6.post3 (sgl-project#6165) * KV‑Cache (MHA, MLA): add missing start_layer / end_layer fields to MHATokenToKVPoolHost and MLATokenToKVPoolHost (sgl-project#6016) Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com> Co-authored-by: chus-chus <chus-chus@users.noreply.github.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * [fix] fix determine_n_share_experts_fusion (sgl-project#6118) * Fix and Clean up chat-template requirement for VLM (sgl-project#6114) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * [Docs]Delete duplicate content (sgl-project#6146) Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> * Revert "feat: add thinking_budget (sgl-project#6089)" (sgl-project#6181) * Added async_encode method to Engine (sgl-project#4701) * Fix data parallel perf regression (sgl-project#6183) * Fix request abortion (sgl-project#6184) * Add typo checker in pre-commit (sgl-project#6179) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> * Remove duplicate IO Struct test (sgl-project#6180) Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> * [PD] Add simple unit test for disaggregation feature (sgl-project#5654) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [CI] Disabled deepep tests temporarily because it takes too much time. (sgl-project#6186) * feat: support loogle eval (sgl-project#6190) * [fix] remove mixtral from is_fa3_default_architecture (sgl-project#6191) * fix: handle None multimodal_inputs during merging and filtering batches in disaggregation decode mode (sgl-project#6169) * chore: upgrade deepgemm (sgl-project#6073) * chore: bump sgl-kernel v0.1.2.post1 (sgl-project#6195) * chore: upgrade sgl-kernel v0.1.2.post1 (sgl-project#6196) Co-authored-by: alcanderian <alcanderian@gmail.com> * Handle empty input string for embedding models (sgl-project#5621) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * doc: fix the erroneous documents and example codes about Alibaba-NLP/gme-Qwen2-VL-2B-Instruct (sgl-project#6199) * [Docs] minor Qwen3 and reasoning parser docs fix (sgl-project#6032) * Improve structured outputs: fix race condition, server crash, metrics and style (sgl-project#6188) * [CI] Reorganize the 8 gpu tests (sgl-project#6192) * Add dev-deepep docker image (sgl-project#6198) * Replace time.time() to time.perf_counter() for benchmarking. (sgl-project#6178) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * Update README.md (sgl-project#6202) * Fix release-docs.yml to not use python 3.9 (sgl-project#6204) * Fix start_profile does not support with_stack and record_shapes (sgl-project#6043) * [doc] add a note for --n-share-experts-fusion args (sgl-project#6154) * Performing Vocabulary Parallelism for LM Head across Attention TP Groups (sgl-project#5558) Co-authored-by: liusy58 <liusy58@linux.alibaba.com> * Update AMD CI docker to v0.4.6.post3-rocm630. (sgl-project#6213) * Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (sgl-project#6201) Co-authored-by: SangBin Cho <rkooo567@gmail.com> * [CI] Fix PD mooncake dependency error (sgl-project#6212) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [CI] Re-enable pd disaggregation test (sgl-project#6231) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * fix some typos (sgl-project#6209) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> * [Docs] Add docs for `SGLANG_` and `SGL_` environment variables (sgl-project#6206) * [PP] Fix init_memory_pool desync & add PP for mixtral (sgl-project#6223) * Revert "fix some typos" (sgl-project#6244) * chore: add hf_xet dep (sgl-project#6243) * Update AMD nightly deps. (sgl-project#6241) * [PD] Add support for different TP sizes per DP rank (sgl-project#5922) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Support incremental streaming of logprob/token_ids between scheduler and detokenizer (sgl-project#6225) Co-authored-by: SangBin Cho <rkooo567@gmail.com> * fix typo (sgl-project#6248) * Support tuning moe for llama 4 model (sgl-project#6042) * Skip the flaky test_stateful_custom_logit_processor (sgl-project#6251) * [Llama4] Add docs note about enable multimodal (sgl-project#6235) * [VERL Use Case] Add torch_memory_saver into deps (sgl-project#6247) * Fix two issues related to `--moe-dense-tp-size=1` (sgl-project#5657) Co-authored-by: liusy58 <liusy58@linux.alibaba.com> Co-authored-by: 颉沆 <xiehang.lsy@alibaba-inc.com> * model(vlm): pixtral (sgl-project#5084) * [misc] deep_gemm fallback to NVRTC when NVCC not found (sgl-project#6252) * Enable MI325X AMD CI. (sgl-project#6259) * chore: bump v0.4.6.post4 (sgl-project#6245) * formatting fix for the rebased commit for 4.6.0_post4 Signed-off-by: Mohit Sinha <msinha@habana.ai> * fix issues in model runner and python packages fix for following issues: > vLLM dependency for xgrammar==0.1.17 > 'Scheduler' object has no attribute 'device > 'pp_proxy_tensors' unexpected arg in HPUGraphRunner > TODO: Add pipeline parallelism support in HPUGraphRunner Signed-off-by: Mohit Sinha <msinha@habana.ai> * fix formatting in model runner Signed-off-by: Mohit Sinha <msinha@habana.ai> * base grammar fix for the is_terminated case > 'OutlinesGrammar' object has no attribute 'is_terminated' Signed-off-by: Mohit Sinha <msinha@habana.ai> --------- Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: congcongke <zhanweidu@163.com> Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> Signed-off-by: Song Zhang <gepin.zs@antgroup.com> Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: Wenxuan Tan <wtan45@wisc.edu> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com> Co-authored-by: vzed <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: DavidBao <121073073+DavidBao03@users.noreply.github.com> Co-authored-by: Frankey_8080 <32973306+Frank-Jie@users.noreply.github.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: yan97ao <580776+yan97ao@users.noreply.github.com> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Michał Moskal <michal@moskal.me> Co-authored-by: lambert0312 <lambert80.ios@gmail.com> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: zhanweidu <zhanweidu@163.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com> Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Michael Yao <haifeng.yao@daocloud.io> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: JiLi <leege233@gmail.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: PGFLMG <1106310035@qq.com> Co-authored-by: sighingnow <sighingnow@gmail.com> Co-authored-by: XTY <xutianyi1999@live.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: woodx <124784234+woodx9@users.noreply.github.com> Co-authored-by: Qiaolin Yu <qy254@cornell.edu> Co-authored-by: Beichen Ma <mabeichen12@gmail.com> Co-authored-by: pengcuo <pengcbupt@163.com> Co-authored-by: pengcuo <dgpengcuo@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: simveit <69345428+simveit@users.noreply.github.com> Co-authored-by: Johnny <johnnync13@gmail.com> Co-authored-by: alcanerian <alcanerian@gmail.com> Co-authored-by: Yuhao Chen <yxckeis8@gmail.com> Co-authored-by: zhjunqin <zhjunqin@users.noreply.github.com> Co-authored-by: liwenju0 <like4hub@gmail.com> Co-authored-by: wenju.li <wenju.li@deepctr.cn> Co-authored-by: laixin <xielx@shanghaitech.edu.cn> Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: 江家瑋 <36886416+JiangJiaWei1103@users.noreply.github.com> Co-authored-by: KCFindstr <shimakaze@google.com> Co-authored-by: xm:D <38322020+xiaomin-D@users.noreply.github.com> Co-authored-by: Lifu Huang <lifu.hlf@gmail.com> Co-authored-by: Yongtong Wu <914554688@qq.com> Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: Hank Han <54751605+HanHan009527@users.noreply.github.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> Co-authored-by: Johnny <johnnynuca14@gmail.com> Co-authored-by: Song Zhang <70674731+botieking98@users.noreply.github.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com> Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: Xuting Zhou <xutingz@nvidia.com> Co-authored-by: ZhengHSI <zhenghsi@qq.com> Co-authored-by: Zhu Chen <51010608+Othame@users.noreply.github.com> Co-authored-by: othame <chenzhu_912@zju.edu.cn> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: Yixin Dong <ubospica@gmail.com> Co-authored-by: xu-yfei <xu_yfei@qq.com> Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com> Co-authored-by: thyecust <tienhoayu@gmail.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: Simon (Jiyou) Li <Simon-Li@users.noreply.github.com> Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com> Co-authored-by: chus-chus <chus-chus@users.noreply.github.com> Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com> Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> Co-authored-by: Steven Shimizu <shimizust@gmail.com> Co-authored-by: applesaucethebun <113181361+applesaucethebun@users.noreply.github.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> Co-authored-by: Yusong Gao <yusong.gao@gmail.com> Co-authored-by: alcanderian <alcanderian@gmail.com> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> Co-authored-by: liusy58 <liusy58@linux.alibaba.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: 颉沆 <xiehang.lsy@alibaba-inc.com> Co-authored-by: Kiv Chen <34561254+KivenChen@users.noreply.github.com>
xwu-intel
pushed a commit
to xwu-intel/sglang
that referenced
this pull request
Jun 17, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
main:
Skip N=576, K=7168 now deepseek-ai/DeepSeek-V3 N=24576 K=7168: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 89.951999 93.024001 84.895998 76.544002 1 8.0 81.216000 84.063999 70.303999 70.175998 2 16.0 82.336001 85.023999 76.223999 64.640000 3 32.0 80.991998 83.552003 64.000003 59.136000 4 64.0 82.064003 84.991999 62.944002 57.760000 5 128.0 77.087998 80.031998 97.952001 61.152000 6 256.0 105.343997 107.391998 143.040001 87.296002 7 512.0 199.167997 197.104007 271.295995 138.144001 8 1024.0 399.904013 378.847986 537.728012 277.904004 9 2048.0 800.607979 767.359972 1053.311944 556.544006 10 4096.0 1619.567990 1522.143960 2200.223923 1127.392054 deepseek-ai/DeepSeek-V3 N=32768 K=512: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 16.416000 19.136000 15.232000 13.024000 1 8.0 16.031999 19.136000 13.760000 12.768000 2 16.0 16.384000 18.848000 13.344000 12.960000 3 32.0 16.384000 18.751999 13.152000 13.088000 4 64.0 16.160000 18.719999 13.376000 13.248000 5 128.0 16.096000 18.751999 18.271999 13.824000 6 256.0 21.312000 23.871999 28.255999 18.751999 7 512.0 34.784000 36.256000 46.239998 28.767999 8 1024.0 62.399998 61.439998 82.047999 46.944000 9 2048.0 116.159998 110.799998 153.408006 85.919999 10 4096.0 223.072007 207.519993 295.664012 165.184006 deepseek-ai/DeepSeek-V3 N=7168 K=16384: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 76.768003 78.960001 112.512000 60.224000 1 8.0 76.031998 78.432001 72.480001 57.856001 2 16.0 76.127999 78.624003 66.944003 57.087999 3 32.0 76.608002 78.720003 70.303999 50.080001 4 64.0 76.320000 79.039998 70.656002 41.887999 5 128.0 76.320000 78.720003 82.847998 53.247999 6 256.0 77.504002 80.416001 107.311994 58.143999 7 512.0 149.471998 150.624007 200.895995 117.919996 8 1024.0 289.696008 285.472006 471.136004 234.623998 9 2048.0 528.591990 522.607982 755.904019 485.760003 10 4096.0 1084.959984 1067.872047 1471.087933 733.471990 deepseek-ai/DeepSeek-V3 N=7168 K=18432: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 85.568003 88.096000 124.224000 65.215997 1 8.0 84.384002 87.007999 83.807997 63.423999 2 16.0 84.799998 86.176001 77.376001 61.567999 3 32.0 84.384002 86.400002 76.800004 54.400001 4 64.0 84.192000 86.592004 78.079998 45.759998 5 128.0 84.063999 86.687997 94.176002 55.936001 6 256.0 86.560003 89.120001 118.752003 66.143997 7 512.0 166.207999 167.104006 220.912009 124.191999 8 1024.0 348.863989 317.519993 514.559984 247.615993 9 2048.0 601.472020 588.096023 827.552021 517.664015 10 4096.0 1282.240033 1237.583995 1631.872058 824.751973 deepseek-ai/DeepSeek-V3 N=4608 K=7168: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 38.015999 41.407999 50.207999 29.023999 1 8.0 37.856001 40.768001 33.824001 26.912000 2 16.0 37.856001 40.768001 33.440001 25.312001 3 32.0 37.951998 40.895998 33.216000 21.919999 4 64.0 37.856001 40.959999 33.535998 19.808000 5 128.0 38.079999 41.120000 40.927999 21.824000 6 256.0 38.431998 41.536000 48.255999 26.752001 7 512.0 69.600001 71.584001 81.408001 36.607999 8 1024.0 101.375997 102.016002 135.263994 65.792002 9 2048.0 164.287999 160.607994 215.903997 114.271998 10 4096.0 299.392015 293.951988 409.103990 269.407988 deepseek-ai/DeepSeek-V3 N=3072 K=7168: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 37.567999 40.320002 41.855998 25.408000 1 8.0 37.471998 40.256001 29.503999 25.312001 2 16.0 37.503999 40.256001 29.279999 23.903999 3 32.0 37.535999 40.383998 29.023999 19.711999 4 64.0 37.471998 40.352002 29.216001 16.319999 5 128.0 37.696000 40.608000 32.000002 17.824000 6 256.0 37.728000 40.895998 40.128000 25.024001 7 512.0 38.304001 41.216001 48.928000 28.672000 8 1024.0 69.215998 71.392000 91.392003 41.664001 9 2048.0 101.807997 102.784000 136.255994 76.959997 10 4096.0 196.544006 198.016003 271.488011 132.927999 deepseek-ai/DeepSeek-V3 N=4096 K=512: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 10.272000 13.376000 9.312000 8.192 1 8.0 10.240000 13.248000 8.480000 7.936 2 16.0 10.272000 13.280000 8.288000 7.808 3 32.0 10.432000 13.280000 8.288000 8.000 4 64.0 10.272000 13.280000 8.160000 8.000 5 128.0 10.464000 13.296000 8.608000 8.032 6 256.0 10.384000 13.312000 9.504000 8.608 7 512.0 10.624000 13.504000 10.624000 9.792 8 1024.0 13.760000 16.287999 16.368000 12.128 9 2048.0 20.927999 23.040000 26.016001 17.600 10 4096.0 34.784000 36.384001 44.480000 27.424 deepseek-ai/DeepSeek-V3 N=3072 K=1536: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 14.528000 17.247999 14.336000 10.336000 1 8.0 14.656000 17.247999 12.032000 10.016000 2 16.0 14.560000 17.247999 11.584000 9.872000 3 32.0 14.528000 17.279999 11.744000 9.376000 4 64.0 14.592000 17.312000 11.680000 9.376000 5 128.0 14.656000 17.376000 12.320000 9.664000 6 256.0 14.816000 17.472001 13.792000 10.464000 7 512.0 14.976000 17.535999 16.480001 11.840000 8 1024.0 22.496000 24.831999 27.680000 16.287999 9 2048.0 30.463999 32.127999 38.015999 23.808001 10 4096.0 53.024001 54.912001 68.672001 37.632000 deepseek-ai/DeepSeek-V3 N=512 K=7168: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 20.608000 31.872001 31.615999 13.632000 1 8.0 20.608000 31.904001 26.912000 13.248000 2 16.0 20.703999 31.904001 26.944000 12.896000 3 32.0 20.768000 31.840000 26.784001 12.864000 4 64.0 20.752000 31.679999 26.912000 12.928000 5 128.0 20.736000 32.000002 26.784001 12.896000 6 256.0 21.056000 32.320000 26.784001 13.120000 7 512.0 21.792000 32.703999 27.295999 15.456000 8 1024.0 26.720000 31.840000 30.719999 16.448000 9 2048.0 33.535998 36.031999 39.551999 21.888001 10 4096.0 46.656001 44.000000 50.719999 29.983999 deepseek-ai/DeepSeek-V3 N=7168 K=2304: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 18.912001 21.919999 22.399999 14.528000 1 8.0 18.464001 21.376001 17.824000 14.176000 2 16.0 18.592000 21.504000 17.344000 14.048000 3 32.0 18.528000 21.407999 17.152000 13.696000 4 64.0 18.464001 21.504000 17.088000 13.312000 5 128.0 18.560000 21.632001 18.848000 14.720000 6 256.0 18.688001 21.792000 22.431999 16.416000 7 512.0 29.279999 31.583998 36.800001 24.224000 8 1024.0 50.624002 50.976001 77.312000 39.584000 9 2048.0 82.911998 80.991998 111.040004 71.199998 10 4096.0 158.976004 152.224004 218.096003 108.319998 deepseek-ai/DeepSeek-V3 N=7168 K=2048: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 17.696001 20.703999 21.760000 12.704000 1 8.0 17.216001 20.191999 14.816000 12.768000 2 16.0 17.600000 20.479999 14.624000 12.320000 3 32.0 17.408000 20.191999 14.592000 11.712000 4 64.0 17.376000 20.288000 14.688000 11.744000 5 128.0 17.535999 20.400001 16.543999 12.800000 6 256.0 17.535999 20.544000 20.096000 14.624000 7 512.0 26.912000 29.440001 33.087999 22.848001 8 1024.0 45.919999 46.688002 70.015997 38.240001 9 2048.0 75.456001 73.536001 101.888001 68.063997 10 4096.0 144.191995 137.503996 193.376005 102.080002 deepseek-ai/DeepSeek-V3 N=7168 K=256: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 9.440000 12.512000 8.736000 8.448000 1 8.0 9.408000 12.448000 7.616000 8.000000 2 16.0 9.408000 12.416000 7.584000 8.192000 3 32.0 9.504000 12.320000 7.584000 8.160000 4 64.0 9.504000 12.352000 7.904000 8.192000 5 128.0 9.408000 12.224000 8.480000 8.480000 6 256.0 9.760000 12.544000 9.568000 9.056000 7 512.0 11.904000 14.496000 12.992000 11.424000 8 1024.0 16.352000 18.848000 23.232000 15.104000 9 2048.0 23.808001 25.599999 34.272000 22.752000 10 4096.0 40.832002 41.824002 61.471999 31.615999 Benchmark finished!
pr:
DeepSeek-V3 tp=8 INFO 04-28 06:38:20 [__init__.py:239] Automatically detected platform cuda. Skip N=576, K=7168 now deepseek-ai/DeepSeek-V3 N=24576 K=7168: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 89.887999 92.224002 85.184000 76.416001 1 8.0 81.216000 83.456002 70.239998 70.592001 2 16.0 82.336001 84.608003 76.352000 65.056004 3 32.0 81.055999 83.071999 64.127997 59.583999 4 64.0 82.479998 84.352002 63.327998 57.952002 5 128.0 77.632003 79.935998 98.240003 61.360002 6 256.0 105.503999 106.112003 143.616006 86.687997 7 512.0 201.072007 197.392002 272.015989 136.656001 8 1024.0 395.135999 386.496007 536.880016 272.992015 9 2048.0 793.167949 762.272000 1054.816008 566.496015 10 4096.0 1594.303966 1581.055999 2187.151909 1143.615961 deepseek-ai/DeepSeek-V3 N=32768 K=512: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 16.480001 19.328000 15.840000 13.024000 1 8.0 16.128000 19.424001 13.696000 12.928000 2 16.0 15.968001 19.136000 13.536000 13.184000 3 32.0 16.224001 19.168001 13.280000 13.056000 4 64.0 16.000001 19.040000 13.728000 13.200000 5 128.0 16.128000 19.231999 18.400000 14.176000 6 256.0 21.536000 23.968000 28.287999 19.168001 7 512.0 34.880001 35.999998 46.271998 29.023999 8 1024.0 62.368002 60.256001 82.751997 47.488000 9 2048.0 117.215998 107.808001 155.328006 85.727997 10 4096.0 223.616004 199.647993 295.520008 165.408000 deepseek-ai/DeepSeek-V3 N=7168 K=16384: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 77.183999 76.991998 112.896003 60.384002 1 8.0 75.935997 76.640002 72.800003 58.240000 2 16.0 76.448001 76.672003 67.167997 57.023998 3 32.0 76.768003 76.831996 70.656002 50.239999 4 64.0 76.448001 77.151999 71.071997 41.855998 5 128.0 75.712003 76.448001 82.815997 53.376000 6 256.0 77.791996 78.272000 108.255997 58.432002 7 512.0 149.215996 147.264004 203.040004 118.207999 8 1024.0 289.664000 285.135984 474.687994 230.463997 9 2048.0 536.095977 511.615992 710.655987 466.239989 10 4096.0 1084.720016 1057.407975 1456.592083 710.687995 deepseek-ai/DeepSeek-V3 N=7168 K=18432: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 85.759997 85.663997 124.767996 65.600000 1 8.0 84.608003 85.023999 83.743997 63.840002 2 16.0 84.895998 84.927998 77.568002 61.535999 3 32.0 85.440002 85.120000 77.087998 54.623999 4 64.0 84.863998 84.991999 78.272000 46.176001 5 128.0 84.959999 85.199997 95.168002 55.872001 6 256.0 86.528003 87.583996 118.592001 66.367999 7 512.0 167.104006 164.031997 221.103996 124.512002 8 1024.0 329.216003 317.535996 513.599992 255.488008 9 2048.0 593.631983 577.744007 799.504042 493.696004 10 4096.0 1387.359977 1252.544045 1652.799964 782.815993 deepseek-ai/DeepSeek-V3 N=4608 K=7168: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 38.304001 40.031999 50.687999 28.928000 1 8.0 38.112000 39.744001 33.856001 26.815999 2 16.0 37.919998 39.744001 33.440001 25.728000 3 32.0 37.888002 39.840002 33.599999 22.112001 4 64.0 38.015999 39.840002 33.696000 20.000000 5 128.0 38.368002 40.031999 41.343998 22.016000 6 256.0 38.431998 40.640000 48.160002 26.880000 7 512.0 69.664001 69.343999 81.696004 36.543999 8 1024.0 101.760000 99.840000 135.008007 65.632001 9 2048.0 164.287999 161.200002 216.064006 114.784002 10 4096.0 298.752010 294.528008 406.816006 273.056000 deepseek-ai/DeepSeek-V3 N=3072 K=7168: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 37.664000 39.584000 42.048000 25.615999 1 8.0 37.439998 39.296001 29.696001 25.248000 2 16.0 37.728000 39.296001 29.503999 23.776000 3 32.0 37.503999 39.328001 29.408000 20.128001 4 64.0 37.567999 39.360002 29.600000 16.448000 5 128.0 37.983999 39.551999 32.032002 18.208001 6 256.0 38.112000 40.064000 40.320002 25.024001 7 512.0 38.527999 40.544000 48.928000 28.640000 8 1024.0 69.728002 70.271999 92.416003 41.664001 9 2048.0 102.112003 102.527998 136.575997 76.704003 10 4096.0 195.968002 193.599999 269.760013 133.184001 deepseek-ai/DeepSeek-V3 N=4096 K=512: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 10.432000 13.728000 9.696000 8.160000 1 8.0 10.560000 13.696000 8.704000 8.096000 2 16.0 10.432000 13.728000 8.544000 7.936000 3 32.0 10.656000 13.760000 8.480000 7.936000 4 64.0 10.720000 13.792000 8.352000 8.192000 5 128.0 10.464000 13.824000 8.800000 8.416000 6 256.0 10.528000 13.696000 9.696000 8.960000 7 512.0 10.912000 13.856000 11.104000 9.952000 8 1024.0 13.984000 16.767999 15.936000 12.256000 9 2048.0 21.215999 23.232000 25.888000 17.856000 10 4096.0 35.168000 35.872001 44.767998 27.712001 deepseek-ai/DeepSeek-V3 N=3072 K=1536: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 14.752000 17.472001 14.368000 10.544000 1 8.0 14.912000 17.472001 12.000000 10.432000 2 16.0 14.912000 17.472001 11.776000 10.112000 3 32.0 14.944000 17.503999 11.936000 9.536000 4 64.0 14.816000 17.519999 11.840000 9.600000 5 128.0 14.880000 17.600000 12.704000 9.664000 6 256.0 14.976000 17.728001 14.112000 10.400000 7 512.0 15.168000 17.792000 16.608000 11.968000 8 1024.0 22.720000 24.896000 27.840000 16.319999 9 2048.0 30.624000 32.352000 38.240001 24.224000 10 4096.0 53.952001 53.856000 69.343999 37.664000 deepseek-ai/DeepSeek-V3 N=512 K=7168: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 20.927999 31.599998 31.840000 13.824000 1 8.0 20.927999 31.615999 27.104000 13.472000 2 16.0 20.959999 31.583998 27.168000 13.248000 3 32.0 20.768000 31.552002 27.200000 13.184000 4 64.0 20.864001 31.328000 27.071999 13.152000 5 128.0 20.768000 31.711999 26.815999 13.056000 6 256.0 21.088000 32.000002 27.008001 13.472000 7 512.0 22.112001 32.288000 27.327999 16.096000 8 1024.0 26.848000 30.975999 30.880000 16.256001 9 2048.0 33.376001 34.432001 39.519999 21.663999 10 4096.0 46.751998 41.983999 50.912000 30.304000 deepseek-ai/DeepSeek-V3 N=7168 K=2304: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 19.200001 21.919999 22.399999 14.720000 1 8.0 18.448001 21.312000 17.824000 14.336000 2 16.0 18.624000 21.472000 17.535999 14.464000 3 32.0 18.560000 21.344000 17.152000 13.728000 4 64.0 18.719999 21.472000 17.472001 13.504000 5 128.0 18.688001 21.568000 19.040000 14.912000 6 256.0 19.072000 21.856001 22.431999 15.936000 7 512.0 29.536000 31.520002 37.280001 24.480000 8 1024.0 51.040001 51.584002 78.111999 39.935999 9 2048.0 83.296001 82.336001 111.040004 71.616001 10 4096.0 159.199998 155.103996 216.495991 109.151997 deepseek-ai/DeepSeek-V3 N=7168 K=2048: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 18.144000 20.768000 21.663999 13.056000 1 8.0 17.632000 20.288000 15.264000 12.832000 2 16.0 17.888000 20.512000 15.072000 12.704000 3 32.0 17.664000 20.384001 14.848000 11.872000 4 64.0 17.568000 20.416001 15.008000 11.936000 5 128.0 17.759999 20.479999 16.896000 12.960000 6 256.0 17.888000 20.640001 20.320000 14.848000 7 512.0 27.264001 29.376000 33.312000 23.391999 8 1024.0 46.335999 47.488000 70.688002 38.400002 9 2048.0 75.552002 74.816003 102.176003 68.351999 10 4096.0 144.127995 141.072005 193.248004 102.240004 deepseek-ai/DeepSeek-V3 N=7168 K=256: fp8 blockwise scaled matmul: batch_size vllm sgl-kernel sglang triton deepgemm 0 1.0 9.568000 12.960000 8.928000 8.608000 1 8.0 9.792000 12.928000 7.776000 8.416000 2 16.0 9.536000 12.928000 8.000000 8.416000 3 32.0 9.504000 12.800000 8.000000 8.160000 4 64.0 9.504000 12.800000 8.064000 8.160000 5 128.0 9.632000 12.768000 8.704000 8.672000 6 256.0 9.568000 12.992000 9.888000 9.280000 7 512.0 11.936000 14.912000 13.184000 11.520000 8 1024.0 16.480001 19.136000 23.552001 15.456000 9 2048.0 24.032000 25.728000 34.655999 22.720000 10 4096.0 40.959999 40.895998 61.503999 31.552002 Benchmark finished!
Modifications
Checklist