-
Notifications
You must be signed in to change notification settings - Fork 2.8k
reduce moe_align_block_size_kernel small batch mode overhead #5086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cool |
@BBuf Can you successfully use the tuning script after this change? |
@zhyncs @merrymercy Now, I have fixed all the bug and performace bug in sgl_kernel |
…gl-project/sglang into opt_moe_align_block_kernel_small_batch
num_tokens num_experts topk SGL Triton VLLM
0 16384.0 8.0 1.0 27.744001 377.983987 126.864001
1 16384.0 8.0 2.0 36.991999 733.215988 259.519994
2 16384.0 8.0 4.0 55.583999 1440.768003 508.512020
3 16384.0 8.0 8.0 92.384003 2859.999895 1001.136065
4 16384.0 32.0 1.0 28.767999 119.584002 146.592006
5 16384.0 32.0 2.0 38.943999 212.543994 295.103997
6 16384.0 32.0 4.0 58.304001 385.760009 608.287990
7 16384.0 32.0 8.0 98.463997 745.248020 1175.904036
8 16384.0 64.0 1.0 30.944001 76.448001 109.856002
9 16384.0 64.0 2.0 41.760001 123.744003 200.320005
10 16384.0 64.0 4.0 63.584000 216.831997 382.048011
11 16384.0 64.0 8.0 107.199997 390.175998 755.392015
12 16384.0 128.0 1.0 30.912001 59.136000 79.296000
13 16384.0 128.0 2.0 40.383998 83.360001 129.728004
14 16384.0 128.0 4.0 58.559999 129.951999 228.271991
15 16384.0 128.0 8.0 96.032001 223.072007 426.912010
16 16384.0 256.0 1.0 35.392001 67.327999 98.463997
17 16384.0 256.0 2.0 41.439999 80.192000 131.871998
18 16384.0 256.0 4.0 57.760000 102.816001 448.000014
19 16384.0 256.0 8.0 90.559997 152.288005 762.592018
20 32768.0 8.0 1.0 37.216000 738.048017 259.552002
21 32768.0 8.0 2.0 55.536002 1451.167941 507.488012
22 32768.0 8.0 4.0 92.896000 2860.960007 1002.847910
23 32768.0 8.0 8.0 169.760004 5724.895954 1990.640044
24 32768.0 32.0 1.0 39.039999 212.543994 293.888003
25 32768.0 32.0 2.0 58.688000 386.511981 610.943973
26 32768.0 32.0 4.0 98.623998 745.696008 1178.751945
27 32768.0 32.0 8.0 180.319995 1468.608022 2393.248081
28 32768.0 64.0 1.0 42.048000 123.039998 200.703993
29 32768.0 64.0 2.0 63.231997 216.352001 382.432014
30 32768.0 64.0 4.0 107.648000 390.464008 751.215994
31 32768.0 64.0 8.0 199.072003 752.991974 1491.999984
32 32768.0 128.0 1.0 40.320002 83.376005 129.567996
33 32768.0 128.0 2.0 58.816001 130.991995 228.863999
34 32768.0 128.0 4.0 96.064001 223.744005 427.583992
35 32768.0 128.0 8.0 172.992006 399.744004 844.015956
36 32768.0 256.0 1.0 41.471999 78.783996 131.935999
37 32768.0 256.0 2.0 57.696000 103.200004 449.088007
38 32768.0 256.0 4.0 90.655997 152.383998 763.616025
39 32768.0 256.0 8.0 159.904003 249.952003 1397.055984
40 65536.0 8.0 1.0 55.520002 1451.712012 507.856011
41 65536.0 8.0 2.0 92.799999 2862.272024 1002.287984
42 65536.0 8.0 4.0 168.799996 5728.703976 1989.536047
43 65536.0 8.0 8.0 315.903991 11395.999908 3962.704182
44 65536.0 32.0 1.0 58.816001 386.335999 609.535992
45 65536.0 32.0 2.0 98.495997 746.240020 1169.664025
46 65536.0 32.0 4.0 181.088001 1468.719959 2406.896114
47 65536.0 32.0 8.0 336.800009 2901.887894 4866.208076
48 65536.0 64.0 1.0 63.231997 216.639996 382.016003
49 65536.0 64.0 2.0 107.327998 390.560001 752.560019
50 65536.0 64.0 4.0 198.080003 752.416015 1490.944028
51 65536.0 64.0 8.0 372.415990 1472.352028 2908.512115
52 65536.0 128.0 1.0 58.975998 130.559996 228.720009
53 65536.0 128.0 2.0 96.064001 223.583996 428.351998
54 65536.0 128.0 4.0 174.464002 399.792016 842.736006
55 65536.0 128.0 8.0 322.880000 761.391997 1638.512015
56 65536.0 256.0 1.0 57.728000 103.391998 447.551996
57 65536.0 256.0 2.0 90.767995 152.160004 763.599992
58 65536.0 256.0 4.0 158.672005 249.952003 1397.632003
59 65536.0 256.0 8.0 289.600015 429.791987 2670.127869
60 131072.0 8.0 1.0 92.735998 2864.223957 1001.824021
61 131072.0 8.0 2.0 168.768004 5724.736214 1993.952036
62 131072.0 8.0 4.0 315.744013 11396.672249 3963.071823
63 131072.0 8.0 8.0 606.112003 22729.358673 7922.160149
64 131072.0 32.0 1.0 98.527998 746.016026 1176.031947
65 131072.0 32.0 2.0 181.024000 1468.736053 2392.623901
66 131072.0 32.0 4.0 336.479992 2903.136015 4890.624046
67 131072.0 32.0 8.0 647.647977 5768.127918 9744.992256
68 131072.0 64.0 1.0 107.423998 390.464008 752.864003
69 131072.0 64.0 2.0 198.783994 752.255976 1490.175962
70 131072.0 64.0 4.0 372.447997 1472.336054 2910.624027
71 131072.0 64.0 8.0 717.664003 2899.647951 5952.688217
72 131072.0 128.0 1.0 96.256003 224.127993 427.648008
73 131072.0 128.0 2.0 173.439994 399.776012 843.312025
74 131072.0 128.0 4.0 322.784007 761.744022 1638.623953
75 131072.0 128.0 8.0 617.088020 1481.727958 3243.520021
76 131072.0 256.0 1.0 90.751998 152.416006 763.360023
77 131072.0 256.0 2.0 159.040004 249.727994 1398.591995
78 131072.0 256.0 4.0 289.503992 429.536015 2671.008110
79 131072.0 256.0 8.0 548.527956 797.551990 5192.447662 Now the kernel outperforms Triton and VLLM under any conditions of tokens, num_experts, and topk. Previously, it was slower than Triton when the token count was large, such as >=65536, because a counting loop had not been changed to a strided loop, resulting in non-contiguous memory access. cc @yiakwy-xpu-ml-framework-team |
* main: (29 commits) reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086) Fix DeepSeek error when using DeepEP mode (sgl-project#5190) [metrics] Add in queue metrics (sgl-project#4444) fix: log warning when disable cuda graph (sgl-project#5209) Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196) sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207) update grok test (sgl-project#5171) model: support mllama4 (sgl-project#5144) [ci] fix ci test fused_moe op (sgl-project#5102) Support Llama4 fp8 inference (sgl-project#5194) Optimize topk operation in llama4 (sgl-project#5128) Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185) [Misc] clean up vllm in sgl-kernel test (sgl-project#5189) Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058) [DeepEP] fix: import buffer error (sgl-project#5179) fix: use DeepEPDispatcher on CUDA (sgl-project#5180) feat: add DeepGEMM build warning (sgl-project#5176) docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110) [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068) [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173) ... # Conflicts: # python/sglang/srt/disaggregation/mini_lb.py # python/sglang/srt/managers/scheduler.py
* Support with_stack and record_shapes in profiler (sgl-project#4740) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840) * Fix CI tests (sgl-project#4853) * Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855) * Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863) * [Feature] add multi-rank support for Lora (sgl-project#4492) Co-authored-by: rudy152 <czh1137892874@gmail.com> * Clean up `import vllm` in quantization/__init__.py (sgl-project#4834) * Fix wrong variable name when stopping memory profile (sgl-project#4772) * [Feat] support deepgemm for cmake (sgl-project#4864) * Make torch compile configurable for biased_grouped_topk (sgl-project#4749) * update sgl-kernel test ci (sgl-project#4866) * fix sampling issue (sgl-project#4871) * bump sgl-kernel 0.0.5.post4 (sgl-project#4768) * fix sgl-kernel cu118 build (sgl-project#4872) * [Feature] Support FA3 backend for MLA (sgl-project#4831) * upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873) * update torch compile doc (sgl-project#4874) * bump v0.4.4.post3 (sgl-project#4878) * Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882) * Improve stack trace of retry errors (sgl-project#4845) * Tiny fix doc error (sgl-project#4795) * [Docs] Update DeepGEMM at README.md (sgl-project#4886) * Update CODEOWNERS (sgl-project#4889) * Delete test_deep_gemm.py (sgl-project#4891) * Add deepseek style fused moe group gate selection kernel (sgl-project#4530) * quick fix: add default for new kernel (sgl-project#4898) * remove setup for sgl-kernel (sgl-project#4899) * [Misc] Clean m.def and add Development Tips (sgl-project#4890) * fix allreduce test (sgl-project#4909) * Support page size > 1 + eagle (sgl-project#4908) * Fix retract for page size > 1 (sgl-project#4914) * [Feature] use pytest for sgl-kernel (sgl-project#4896) * fix bmm fp8 (sgl-project#4926) * Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927) * Fix 2-gpu CI test and suppress some warnings (sgl-project#4930) * [feat] add fa3 in sgl-kernel (sgl-project#4902) Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> * Fix sglang frontend's incorrect dependency on torch (sgl-project#4931) * [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932) * cleanup sgl-kernel (sgl-project#4933) * [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925) * Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883) Co-authored-by: ch-wan <cwan39@gatech.edu> * [Fix] Add torch compile for torch.clamp back (sgl-project#4936) * Fix oom error for large page size (sgl-project#4913) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * [feat] interface for platforms abstraction (sgl-project#4928) * [Fix] revert clean m.def for cudagraph (sgl-project#4944) * refactor: multimodal data (sgl-project#4754) * bump sgl-kernel v0.0.6 (sgl-project#4950) * [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953) * use fa3 in sgl-kernel (sgl-project#4954) * Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959) * [Feature] Support DeepEP Low Latency (sgl-project#4767) Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: ch-wan <cwan39@gatech.edu> * update bench_serving (sgl-project#4958) * Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977) * [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: GeLee <leege233@gmail.com> * Large page size aligned hierarchical caching (sgl-project#4581) * bug fix for hicache host eviction (sgl-project#4989) * sgl scaled_fp8_quant support output padding (sgl-project#4861) * Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com> * Update tokenizer_manager.py (sgl-project#5008) * [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817) * update cutlass tag (sgl-project#5011) * Feature/revise docs ci (sgl-project#5009) * fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727) Co-authored-by: yuethe <yuethe@tencent.com> * [Build] Support build sgl-kernel with ccache (sgl-project#5020) * fix deepgemm as well (sgl-project#5030) * try to fix ci oserror (sgl-project#5024) * Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005) * Small refactor DeepEPMode to clean up code a bit (sgl-project#4992) * [Fix] fix fa3 build at cu118 (sgl-project#5036) * Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048) * bump sgl-kernel v0.0.7 (sgl-project#5046) * update eagle-3 docs (sgl-project#4796) Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> * Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * Update the retry count (sgl-project#5051) * upgrade sgl-kernel v0.0.7 (sgl-project#5049) * [2/3] fix dsv3 awq issue (sgl-project#4625) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> * Feature/revise docs ci (sgl-project#5056) * Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057) * [fix] remove `cuda_device_count_stateless` (sgl-project#5060) * Small refactor DeepEPDispatcher into subclasses (sgl-project#4994) * Support async DeepEP by splitting into two stages (sgl-project#4995) * Cleanup unused resources after DeepEP operation (sgl-project#4996) * Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918) * [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072) * fix dummy-load deepseekv2 (sgl-project#4535) * support sgl-kernel on blackwell (sgl-project#5074) * FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Chunan Zeng <zcnrex@gmail.com> * [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052) * upgrade transformers 4.51.0 (sgl-project#5088) * sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079) * bump sgl-kernel 0.0.8 (sgl-project#5089) * python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080) * bump v0.4.4.post4 (sgl-project#5091) * Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097) Co-authored-by: shuaills <shishuaiuoe@gmail.com> * Add Llama4 support (sgl-project#5092) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@163.com> * Fix refactor error - fp8.py (sgl-project#5106) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * bump v0.4.5 (sgl-project#5117) * [ci] fix llama4 ci error (sgl-project#5126) * Refactor and Optimize FA3 Code (sgl-project#5090) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> * Add Llama4 user guide (sgl-project#5133) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * [Misc] Use pytest.mark.skipif in sgl-kernel test (sgl-project#5137) * feat: disable grammar restrictions within reasoning sections (sgl-project#4984) Co-authored-by: tianhaoyu <thy@mail.ecust.edu.cn> Co-authored-by: DarkSharpness <2040703891@qq.com> * [modelopt] automatically inspect if model is ModelOpt quantized and set quantization method (sgl-project#5145) * [AMD] Fix missing per_token_group_quant_fp8 for ROCm (sgl-project#5140) * fix multimodal hash feature (sgl-project#5083) * Fix run time error in ROCm platform (sgl-project#5147) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: root <root@dell300x-pla-t10-17.pla.dcgpu> * [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (sgl-project#5103) * Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (sgl-project#4760) * Use public model for FA3 speculative decode testing (sgl-project#5152) * Add dummy grok test to amd CI. (sgl-project#5115) * fix empty_cache error in pt_weights_iterator (sgl-project#5151) Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> * Fix torch compile errors (sgl-project#5158) * Fix loading KV quantization scale; Enable modelopt kv cache (sgl-project#4686) Co-authored-by: qingquansong <ustcsqq@gmail.com> * [PD] Fix unclosed prefill connection warning of mini_lb (sgl-project#5155) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Add optimized native kernels in sgl-kernel (sgl-project#5150) Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: blzheng <beilei.zheng@intel.com> * [PD] Simplify mini LB (sgl-project#4911) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> * Small improvement of native api docs (sgl-project#5139) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> * [feat&refactor] Enhance multimodal input support with refactor io_struct (sgl-project#4938) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Support 2x8xH100 for Llama 4 (sgl-project#5159) * FP4 weight loading and inference (2/2) (sgl-project#3972) * Fix multimodal hashing error (sgl-project#5174) * Tiny disable model that does not work (sgl-project#5175) * [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173) * [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068) Co-authored-by: ch-wan <cwan39@gatech.edu> * docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * feat: add DeepGEMM build warning (sgl-project#5176) Co-authored-by: grimoire <streetyao@live.com> * fix: use DeepEPDispatcher on CUDA (sgl-project#5180) * [DeepEP] fix: import buffer error (sgl-project#5179) * Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058) * [Misc] clean up vllm in sgl-kernel test (sgl-project#5189) * Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185) Co-authored-by: wunhuang <wunhuang@amd.com> * Optimize topk operation in llama4 (sgl-project#5128) * Support Llama4 fp8 inference (sgl-project#5194) Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: zhyncs <me@zhyncs.com> * [ci] fix ci test fused_moe op (sgl-project#5102) * model: support mllama4 (sgl-project#5144) * update grok test (sgl-project#5171) * sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207) * Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196) * fix: log warning when disable cuda graph (sgl-project#5209) * [metrics] Add in queue metrics (sgl-project#4444) * Fix DeepSeek error when using DeepEP mode (sgl-project#5190) * reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086) * [PD] Support KV transfer with mooncake (sgl-project#4880) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Co-authored-by: shangmingc <csmthu@gmail.com> * [PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool (sgl-project#5204) * Update deps for mllama4 (sgl-project#5215) * Fix deepseek-v3 with torch.compile in PyTorch 2.6. (sgl-project#5213) * ROCm sgl-kernel: compatible to later torch (sgl-project#5167) * [Misc] Clean sgl-kernel test (sgl-project#5216) * Update Makefile / build script to avoid installing incompatible torch dependency (sgl-project#5245) * Fix torch.compile cacheing (sgl-project#5259) Co-authored-by: zhyncs <me@zhyncs.com> * ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations (sgl-project#5228) * Optimize attention in llama4 (sgl-project#5127) * Optimize GPU memory usage in FlashAttentionBackend's strided indexing (sgl-project#5262) Co-authored-by: ch-wan <cwan39@gatech.edu> * Support `--enable-llama4-multimodal` (sgl-project#5254) * [fix] fix mrope positions not picked up (sgl-project#5265) * doc: nested loop code for offline engine (sgl-project#5244) * fix: examples for token_in_token_out_vlm (sgl-project#5193) * Fix a 404 link in send_request.ipynb (sgl-project#5280) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * fix: enable fp4 compilation on cu128 (sgl-project#5286) * feat: add cu128 identifier for sgl-kernel (sgl-project#5287) * chore: relax the torch version restriction for sgl-kernel compilation (sgl-project#5288) * chore: bump sgl-kernel v0.0.8.post1 (sgl-project#5289) * [PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout (sgl-project#5292) * [Docs] Supported Model Docs - Major restructuring (sgl-project#5290) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> * fix: update update_wheel_index for cu128 (sgl-project#5300) * [Docs] Remove the older supported docs section (sgl-project#5301) * remove moe_align_block_size torch.zeros in small batch/expert mode (sgl-project#5298) * feat: add blackwell Dockerfile (sgl-project#5302) * feat: add blackwell workflow (sgl-project#5303) * fix: use fa3 unit test on hopper only (sgl-project#5304) * misc: update blackwell Dockerfile (sgl-project#5306) * fix: remove cublas_grouped_gemm (sgl-project#5307) * fix: update flash attn (sgl-project#5308) * fix: use deepgemm only on hopper (sgl-project#5310) * [VLM] Adopt fast image processor by default (sgl-project#5065) * Adjust ci test threshold (sgl-project#5271) * Blackwell Cutlass MLA kernel (sgl-project#5142) * misc: cleanup 3rdparty (sgl-project#5311) * update variable naming and comments for rocm (sgl-project#5299) * Fix w8a8_int8 model shared experts fusion load weights error (sgl-project#5120) * Add flash_attn_varlen_func to sgl-kernel (sgl-project#5315) * Fix fa3 window size setup (sgl-project#5316) * chore: bump sgl-kernel v0.0.8.post2 (sgl-project#5317) * feat: use fa3 mla by default on hopper (sgl-project#5210) Co-authored-by: yundai424 <yundai424@gmail.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> * Fix: docs/backend/structured_outputs.ipynb (sgl-project#4884) * Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… (sgl-project#5321) * refine fused_moe tuning docs (sgl-project#5294) * Support server based rollout in Verlengine (sgl-project#4848) Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com> * [Feat] Add sparse attn to sgl-kernel (sgl-project#5327) * fix: solve cu118 issue for cutlass mla (sgl-project#5331) * chore: bump sgl-kernel v0.0.8.post3 (sgl-project#5332) * ci: update release node (sgl-project#5333) * fix: determine if flashinfer is installed (sgl-project#5336) * feat: adapt merge_state (sgl-project#5337) * misc: update sagemaker Dockerfile (sgl-project#5341) * Fix: Ensure tensors for dist.broadcast match NCCL backend device (sgl-project#5322) * docs: update adoption and sponsorship list with Oracle (sgl-project#5343) * chore: upgrade sgl-kernel 0.0.8.post3 (sgl-project#5342) * Fix typo: infight -> inflight (sgl-project#5357) * [PD] Add transfer backend abstraction (sgl-project#5328) * fix MLATokenToKVPoolHost get_size_per_token bug (sgl-project#5161) Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> * fix sgl-project#5322 (sgl-project#5359) * feat: update experiment_runner (sgl-project#5360) * [DeepEP] Reduce routed scaling overhead (sgl-project#5277) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Fix DeepSeek DP Attention + torch compile (sgl-project#5367) Co-authored-by: ispobock <ispobaoke@163.com> * Support for Qwen2.5-VL Model in bitsandbytes Format (sgl-project#5003) * Fix PD disaggregation bugs (sgl-project#5326) * [PD Bug] fix MLA get_contiguous_buf_infos error (sgl-project#5384) * [perf] experimental enhance fp8 per-tensor quant (sgl-project#5370) * Apply deepseek cuda rope (sgl-project#5385) Co-authored-by: Yineng Zhang <me@zhyncs.com> * apply fused moe gate in ds v3/r1 (sgl-project#5371) Co-authored-by: Yineng Zhang <me@zhyncs.com> * fix: update test config (sgl-project#5392) * [Fix] Turn off DeepGEMM by default (sgl-project#5263) * minor clean up of sgl-kernel/CMakeLists.txt (sgl-project#5393) * Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5368) * Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5291) Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> * [fix/misc] remove duplicate row in deepseek v2 model (sgl-project#5279) * chore: upgrade DeepGEMM (sgl-project#5395) * fix: update pr-test-sgl-kernel (sgl-project#5399) * kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381) * chore: bump sgl-kernel 0.0.9 (sgl-project#5400) * chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401) * Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406) * Fix bench_serving with random-ids (sgl-project#5214) * [misc] fix ci flaky case (sgl-project#5352) * [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412) * Support dynamic connection and TP 16 (sgl-project#5351) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416) * [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: ybyang <ybyang7@iflytek.com> * Distinguish bootstrap key only in decode server (sgl-project#5422) * [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423) * [minor] cleanup cmakelists.txt (sgl-project#5420) * bugfix: fix merge_state_v2 cuda graph (sgl-project#5419) * chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430) * fix: solve release issue (sgl-project#5434) * BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431) * feat: update model_specific_adjustment (sgl-project#5344) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> * chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436) * Fix ignore_eos parameter when loading a chat template (sgl-project#5264) * add attention backend supporting matrix in the doc (sgl-project#5211) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> * Support BNB quantization for llama/mllama (sgl-project#5038) Co-authored-by: Yuhao Yang <yyh073@foxmail.com> * [Docs] Update start/install.md (sgl-project#5398) * [Minor] Move torch.compile patch to a better place (sgl-project#5397) * [Bug fix] need record start time in pd mode (sgl-project#5425) * Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113) * chore: bump v0.4.5.post1 (sgl-project#5445) * Revert "[SW-226289] rebase sglang to tag v0.4.5 (sgl-project#12)" This reverts commit 0eac714. --------- Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: Kay Yan <kay.yan@daocloud.io> Signed-off-by: windsonsea <haifeng.yao@daocloud.io> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Juwan Yoo <ryan@tmfi.us> Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: chaobo jia <91889375+jcbjcbjc@users.noreply.github.com> Co-authored-by: rudy152 <czh1137892874@gmail.com> Co-authored-by: Fr4nk1in <sh.fu@outlook.com> Co-authored-by: yinfan98 <1106310035@qq.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> Co-authored-by: SEPLOS <seplos@aliyun.com> Co-authored-by: ch-wan <cwan39@gatech.edu> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: GeLee <leege233@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com> Co-authored-by: Kaiyu Yang <yangky@umich.edu> Co-authored-by: renxin <90580890+renxinx@users.noreply.github.com> Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com> Co-authored-by: yuethe <yuethe@tencent.com> Co-authored-by: simveit <69345428+simveit@users.noreply.github.com> Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: Tommy Yang <tommyyang0524@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: tianhaoyu <thy@mail.ecust.edu.cn> Co-authored-by: DarkSharpness <2040703891@qq.com> Co-authored-by: Yun Dai <yundai424@gmail.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com> Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: root <root@dell300x-pla-t10-17.pla.dcgpu> Co-authored-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: DangKai <dangkai4u@outlook.com> Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: Ma Mingfei <mingfei.ma@intel.com> Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: blzheng <beilei.zheng@intel.com> Co-authored-by: Byron Hsu <byronhsu1230@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: Kay Yan <kay.yan@daocloud.io> Co-authored-by: grimoire <streetyao@live.com> Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: Zhaoyang Hao <77828610+Muuuchen@users.noreply.github.com> Co-authored-by: Teng Ma <805522925@qq.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Co-authored-by: Richard Zou <zou3519@users.noreply.github.com> Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com> Co-authored-by: Michael Yao <haifeng.yao@daocloud.io> Co-authored-by: Yusong Gao <yusong.gao@icloud.com> Co-authored-by: Zhaoyi Li <36555117+Lzy17@users.noreply.github.com> Co-authored-by: lambert0312 <lambert80.ios@gmail.com> Co-authored-by: tianlian yi <91449279+yitianlian@users.noreply.github.com> Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com> Co-authored-by: yulei <yuulei12@gmail.com> Co-authored-by: Yongtong Wu <914554688@qq.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com> Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> Co-authored-by: Yangcheng Li <bluebluelitchi@hotmail.com> Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: ybyang <ybyang7@iflytek.com> Co-authored-by: mRSun15 <3150105645@zju.edu.cn> Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com> Co-authored-by: Yuhao Yang <yyh073@foxmail.com>
Motivation
moe_align_block_size
kernel benchmark bug.torch.zeros
.cumsum_buffer
allocate fromtorch.zeros
totorch.empty
infused_moe_triton.py
. For small token mode, I wrote a new simple kernel to handle it, and it's not need to usetorch.zeros
to init thecumsum
buffer, all things are happed in shm.Acc test
I set
token_cnts_buffer
andcumsum_buffer
totorch.empty
infused_moe.py
:Acc result:
Kernel unit-test
Benchmark In H200
main branch:
📊 Running performance benchmark for 8 experts... moe-align-block-size-performance: num_tokens num_experts topk SGL Triton VLLM 0 1.0 8.0 1.0 18.975999 67.359999 14.624000 1 1.0 8.0 2.0 19.136000 23.264000 14.607999 2 1.0 8.0 4.0 20.384001 63.519999 14.624000 3 1.0 8.0 8.0 19.424001 62.368002 14.656000 4 1.0 32.0 1.0 20.128001 62.912002 16.640000 5 1.0 32.0 2.0 21.536000 63.263997 16.640000 6 1.0 32.0 4.0 21.632001 69.023997 16.576000 7 1.0 32.0 8.0 20.256000 59.712000 16.640000 8 1.0 64.0 1.0 22.816001 56.912001 20.128001 9 1.0 64.0 2.0 22.816001 28.672000 20.256000 10 1.0 64.0 4.0 22.784000 69.440000 20.223999 11 1.0 64.0 8.0 22.816001 65.024003 20.288000 12 1.0 128.0 1.0 24.224000 64.511999 30.975999 13 1.0 128.0 2.0 23.040000 68.335995 30.944001 14 1.0 128.0 4.0 24.288001 63.167997 30.944001 15 1.0 128.0 8.0 23.135999 63.808002 31.040000 16 1.0 256.0 1.0 26.559999 58.240000 65.471999 17 1.0 256.0 2.0 26.591999 70.528001 65.471999 18 1.0 256.0 4.0 26.815999 61.471999 65.632001 19 1.0 256.0 8.0 27.872000 59.680000 65.664001 20 8.0 8.0 1.0 19.200001 65.568000 14.688000 21 8.0 8.0 2.0 19.455999 60.447998 14.688000 22 8.0 8.0 4.0 20.320000 69.552004 14.544000 23 8.0 8.0 8.0 20.447999 68.736002 14.720000 24 8.0 32.0 1.0 21.600001 60.208000 16.640000 25 8.0 32.0 2.0 21.663999 59.999999 16.608000 26 8.0 32.0 4.0 20.576000 62.399998 16.576000 27 8.0 32.0 8.0 21.632001 60.095999 16.672000 28 8.0 64.0 1.0 21.376001 27.264001 20.223999 29 8.0 64.0 2.0 21.376001 60.112000 20.320000 30 8.0 64.0 4.0 21.344000 62.816001 20.223999 31 8.0 64.0 8.0 22.911999 27.360000 20.320000 32 8.0 128.0 1.0 24.288001 57.535999 30.975999 33 8.0 128.0 2.0 24.256000 57.856001 31.104000 34 8.0 128.0 4.0 23.135999 64.544000 31.104000 35 8.0 128.0 8.0 24.224000 57.392001 31.136001 36 8.0 256.0 1.0 27.872000 69.199994 65.632001 37 8.0 256.0 2.0 26.591999 63.040003 65.632001 38 8.0 256.0 4.0 27.872000 55.744000 65.664001 39 8.0 256.0 8.0 27.872000 69.888003 65.696001 40 16.0 8.0 1.0 20.352000 58.944002 14.592000 41 16.0 8.0 2.0 20.352000 53.440001 14.560000 42 16.0 8.0 4.0 19.200001 55.904001 14.720000 43 16.0 8.0 8.0 20.416001 71.744002 14.944000 44 16.0 32.0 1.0 21.600001 65.295994 16.640000 45 16.0 32.0 2.0 20.064000 66.111997 16.576000 46 16.0 32.0 4.0 20.160001 58.543999 16.704001 47 16.0 32.0 8.0 21.600001 156.992003 16.960001 48 16.0 64.0 1.0 21.376001 63.135996 20.288000 49 16.0 64.0 2.0 22.752000 61.792001 20.256000 50 16.0 64.0 4.0 21.344000 63.792005 20.320000 51 16.0 64.0 8.0 21.824000 57.535999 20.416001 52 16.0 128.0 1.0 23.040000 70.175998 31.120000 53 16.0 128.0 2.0 24.224000 63.936003 31.040000 54 16.0 128.0 4.0 23.135999 68.640001 31.168001 55 16.0 128.0 8.0 23.232000 57.119999 31.231999 56 16.0 256.0 1.0 27.904000 66.239998 65.696001 57 16.0 256.0 2.0 26.976001 68.191998 65.664001 58 16.0 256.0 4.0 26.815999 57.280000 65.664001 59 16.0 256.0 8.0 27.968001 62.688001 65.888003 60 32.0 8.0 1.0 20.320000 68.127997 14.560000 61 32.0 8.0 2.0 20.320000 63.167997 14.720000 62 32.0 8.0 4.0 20.352000 56.063998 14.976000 63 32.0 8.0 8.0 19.040000 68.223998 15.552000 64 32.0 32.0 1.0 20.064000 61.280001 16.608000 65 32.0 32.0 2.0 21.663999 59.776001 16.704001 66 32.0 32.0 4.0 21.504000 55.968001 16.992001 67 32.0 32.0 8.0 20.160001 64.000003 17.344000 68 32.0 64.0 1.0 21.824000 61.951999 20.223999 69 32.0 64.0 2.0 21.856001 60.896002 20.320000 70 32.0 64.0 4.0 21.856001 54.048002 20.416001 71 32.0 64.0 8.0 21.632001 72.704002 20.864001 72 32.0 128.0 1.0 23.072001 65.888003 31.136001 73 32.0 128.0 2.0 22.944000 60.063999 31.104000 74 32.0 128.0 4.0 23.072001 70.335999 31.199999 75 32.0 128.0 8.0 23.391999 60.288001 31.328000 76 32.0 256.0 1.0 27.872000 67.456000 65.728001 77 32.0 256.0 2.0 27.775999 67.071997 65.664001 78 32.0 256.0 4.0 27.872000 61.567999 65.792002 79 32.0 256.0 8.0 26.815999 69.023997 65.920003 80 64.0 8.0 1.0 20.479999 58.272000 14.720000 81 64.0 8.0 2.0 20.576000 65.087996 14.944000 82 64.0 8.0 4.0 20.064000 65.375999 15.552000 83 64.0 8.0 8.0 19.424001 72.959997 16.960001 84 64.0 32.0 1.0 21.600001 68.624005 16.736001 85 64.0 32.0 2.0 21.695999 67.376003 16.928000 86 64.0 32.0 4.0 21.504000 64.095996 17.344000 87 64.0 32.0 8.0 21.856001 64.159997 19.200001 88 64.0 64.0 1.0 21.600001 70.367999 20.320000 89 64.0 64.0 2.0 22.848001 68.847999 20.447999 90 64.0 64.0 4.0 21.952000 62.912002 20.800000 91 64.0 64.0 8.0 22.368001 65.728001 21.056000 92 64.0 128.0 1.0 23.167999 75.471997 31.136001 93 64.0 128.0 2.0 23.232000 32.575998 31.296000 94 64.0 128.0 4.0 23.391999 67.727998 31.296000 95 64.0 128.0 8.0 24.383999 60.672000 31.968001 96 64.0 256.0 1.0 26.591999 74.047998 65.728001 97 64.0 256.0 2.0 27.872000 66.463999 65.888003 98 64.0 256.0 4.0 28.063999 64.095996 65.888003 99 64.0 256.0 8.0 26.591999 69.824003 66.111997 100 128.0 8.0 1.0 19.168001 68.896003 14.944000 101 128.0 8.0 2.0 20.096000 60.768001 15.568000 102 128.0 8.0 4.0 20.608000 67.039996 16.928000 103 128.0 8.0 8.0 20.191999 69.632001 20.864001 104 128.0 32.0 1.0 21.536000 67.135997 16.960001 105 128.0 32.0 2.0 21.504000 68.000004 17.344000 106 128.0 32.0 4.0 22.016000 64.223997 19.231999 107 128.0 32.0 8.0 21.888001 68.672001 23.296000 108 128.0 64.0 1.0 22.879999 73.504001 20.416001 109 128.0 64.0 2.0 21.663999 59.712000 20.832000 110 128.0 64.0 4.0 21.728000 66.624001 21.056000 111 128.0 64.0 8.0 21.280000 68.832003 22.560000 112 128.0 128.0 1.0 23.296000 60.095999 31.231999 113 128.0 128.0 2.0 22.911999 70.015997 31.296000 114 128.0 128.0 4.0 24.416000 70.464000 32.032002 115 128.0 128.0 8.0 23.328001 71.327999 32.960001 116 128.0 256.0 1.0 27.039999 216.959998 65.888003 117 128.0 256.0 2.0 28.031999 65.743998 65.856002 118 128.0 256.0 4.0 27.872000 69.632001 66.207998 119 128.0 256.0 8.0 27.008001 68.688005 66.463999 120 256.0 8.0 1.0 20.128001 58.143999 15.536000 121 256.0 8.0 2.0 20.671999 63.231997 16.960001 122 256.0 8.0 4.0 20.223999 65.600000 20.896001 123 256.0 8.0 8.0 21.695999 71.744002 28.287999 124 256.0 32.0 1.0 21.632001 59.103999 17.344000 125 256.0 32.0 2.0 22.016000 29.056000 19.231999 126 256.0 32.0 4.0 20.927999 68.159997 23.264000 127 256.0 32.0 8.0 22.336001 67.520000 30.432001 128 256.0 64.0 1.0 21.632001 29.888000 20.864001 129 256.0 64.0 2.0 22.592001 58.591999 21.088000 130 256.0 64.0 4.0 21.728000 58.623999 22.399999 131 256.0 64.0 8.0 23.744000 66.367999 27.680000 132 256.0 128.0 1.0 23.391999 65.600000 31.296000 133 256.0 128.0 2.0 23.135999 63.023999 32.000002 134 256.0 128.0 4.0 24.704000 68.031996 32.960001 135 256.0 128.0 8.0 23.744000 61.919998 36.031999 136 256.0 256.0 1.0 28.224001 68.768002 65.920003 137 256.0 256.0 2.0 26.591999 65.952003 66.192001 138 256.0 256.0 4.0 27.807999 52.064002 66.479996 139 256.0 256.0 8.0 28.896000 58.784001 68.191998 140 512.0 8.0 1.0 19.231999 69.408000 16.960001 141 512.0 8.0 2.0 19.455999 58.432002 20.800000 142 512.0 8.0 4.0 20.512000 69.215998 28.320000 143 512.0 8.0 8.0 21.056000 114.047997 42.080000 144 512.0 32.0 1.0 21.919999 59.119999 19.200001 145 512.0 32.0 2.0 20.992000 59.071999 23.264000 146 512.0 32.0 4.0 21.536000 58.079999 30.400001 147 512.0 32.0 8.0 23.072001 59.840001 44.512000 148 512.0 64.0 1.0 22.464000 58.304001 21.136001 149 512.0 64.0 2.0 21.663999 32.256000 22.528000 150 512.0 64.0 4.0 23.424000 50.687999 28.112000 151 512.0 64.0 8.0 23.808001 63.519999 41.184001 152 512.0 128.0 1.0 23.040000 59.840001 31.968001 153 512.0 128.0 2.0 23.296000 65.888003 32.960001 154 512.0 128.0 4.0 23.647999 69.327995 35.872001 155 512.0 128.0 8.0 24.192000 69.343999 43.040000 156 512.0 256.0 1.0 26.848000 69.087997 66.111997 157 512.0 256.0 2.0 26.912000 70.319995 66.399999 158 512.0 256.0 4.0 28.928000 56.832001 68.159997 159 512.0 256.0 8.0 28.928000 65.952003 71.392000 160 1024.0 8.0 1.0 20.223999 73.151998 20.896001 161 1024.0 8.0 2.0 21.632001 69.087997 28.352000 162 1024.0 8.0 4.0 22.048000 113.696001 42.032000 163 1024.0 8.0 8.0 25.504000 206.880003 71.039997 164 1024.0 32.0 1.0 21.952000 63.648000 23.264000 165 1024.0 32.0 2.0 21.504000 66.512004 30.400001 166 1024.0 32.0 4.0 22.080000 63.167997 44.480000 167 1024.0 32.0 8.0 24.752000 166.495994 78.368001 168 1024.0 64.0 1.0 21.183999 69.728002 22.464000 169 1024.0 64.0 2.0 23.488000 57.376001 27.664000 170 1024.0 64.0 4.0 22.879999 57.744000 40.959999 171 1024.0 64.0 8.0 26.303999 70.464000 65.183997 172 1024.0 128.0 1.0 23.776000 68.095997 32.864001 173 1024.0 128.0 2.0 23.680000 62.112000 35.824001 174 1024.0 128.0 4.0 24.544001 57.599999 43.040000 175 1024.0 128.0 8.0 27.616000 67.071997 55.135999 176 1024.0 256.0 1.0 26.912000 55.408001 66.463999 177 1024.0 256.0 2.0 27.456000 68.448000 68.191998 178 1024.0 256.0 4.0 27.968001 60.768001 71.295999 179 1024.0 256.0 8.0 30.304000 62.080000 81.887998 180 2048.0 8.0 1.0 20.256000 69.151998 28.320000 181 2048.0 8.0 2.0 20.927999 113.920003 42.048000 182 2048.0 8.0 4.0 25.504000 207.039997 71.071997 183 2048.0 8.0 8.0 32.224000 378.208011 127.519995 184 2048.0 32.0 1.0 22.464000 63.519999 30.432001 185 2048.0 32.0 2.0 23.040000 62.272001 44.447999 186 2048.0 32.0 4.0 23.488000 72.512001 79.392001 187 2048.0 32.0 8.0 29.152000 119.616002 146.559998 188 2048.0 64.0 1.0 23.712000 59.424002 27.295999 189 2048.0 64.0 2.0 22.976000 64.640000 40.927999 190 2048.0 64.0 4.0 25.184000 57.472002 65.183997 191 2048.0 64.0 8.0 31.936001 75.903997 110.271998 192 2048.0 128.0 1.0 23.744000 65.952003 35.744000 193 2048.0 128.0 2.0 25.472000 53.984001 43.072000 194 2048.0 128.0 4.0 26.368000 58.848001 55.135999 195 2048.0 128.0 8.0 32.191999 62.080000 79.328001 196 2048.0 256.0 1.0 28.896000 48.416000 68.223998 197 2048.0 256.0 2.0 27.968001 66.111997 71.359999 198 2048.0 256.0 4.0 31.199999 59.872001 81.791997 199 2048.0 256.0 8.0 37.216000 66.111997 98.272003 200 4096.0 8.0 1.0 20.927999 113.920003 42.048000 201 4096.0 8.0 2.0 25.504000 206.624001 71.039997 202 4096.0 8.0 4.0 31.136001 378.224015 127.424002 203 4096.0 8.0 8.0 43.839999 729.439974 260.639995 204 4096.0 32.0 1.0 21.952000 60.112000 44.544000 205 4096.0 32.0 2.0 23.520000 72.255999 78.847997 206 4096.0 32.0 4.0 29.152000 119.616002 147.872001 207 4096.0 32.0 8.0 46.144001 211.807996 294.223994 208 4096.0 64.0 1.0 22.848001 59.551999 40.991999 209 4096.0 64.0 2.0 25.376000 58.688000 65.279998 210 4096.0 64.0 4.0 30.975999 75.935997 110.207997 211 4096.0 64.0 8.0 49.504001 122.432001 200.703993 212 4096.0 128.0 1.0 24.544001 71.392000 43.040000 213 4096.0 128.0 2.0 27.488001 48.799999 55.232000 214 4096.0 128.0 4.0 32.224000 60.192000 79.328001 215 4096.0 128.0 8.0 49.311999 83.839998 129.695997 216 4096.0 256.0 1.0 28.063999 60.192000 71.263999 217 4096.0 256.0 2.0 30.272000 67.744002 81.791997 218 4096.0 256.0 4.0 36.928002 68.432003 98.240003 219 4096.0 256.0 8.0 49.472000 78.560002 132.175997 220 8192.0 8.0 1.0 26.880000 206.944004 71.039997 221 8192.0 8.0 2.0 32.095999 378.127992 127.568007 222 8192.0 8.0 4.0 43.968000 730.080009 260.320008 223 8192.0 8.0 8.0 69.408000 1435.008049 507.327974 224 8192.0 32.0 1.0 23.520000 72.832003 78.656003 225 8192.0 32.0 2.0 29.216001 119.935997 147.072002 226 8192.0 32.0 4.0 47.072001 212.495998 291.584015 227 8192.0 32.0 8.0 71.071997 386.880010 612.031996 228 8192.0 64.0 1.0 26.591999 53.760000 65.311998 229 8192.0 64.0 2.0 30.944001 76.063998 110.111997 230 8192.0 64.0 4.0 49.791999 122.528002 200.800002 231 8192.0 64.0 8.0 77.632003 215.792000 382.863998 232 8192.0 128.0 1.0 26.400000 60.031999 55.167999 233 8192.0 128.0 2.0 32.448001 60.640000 79.392001 234 8192.0 128.0 4.0 49.472000 83.871998 129.728004 235 8192.0 128.0 8.0 73.792003 131.487995 228.960007 236 8192.0 256.0 1.0 30.368000 62.928006 81.823997 237 8192.0 256.0 2.0 37.248001 65.792002 98.112002 238 8192.0 256.0 4.0 50.592002 78.688003 132.128000 239 8192.0 256.0 8.0 75.328000 105.439998 448.480010
pr: