cutlass 3.9 supported to improve fp8_blockwise_gemm #5820

BBuf · 2025-04-28T06:56:21Z

Motivation

main:

Skip N=576, K=7168 now
deepseek-ai/DeepSeek-V3 N=24576 K=7168: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton     deepgemm
0          1.0    89.951999    93.024001      84.895998    76.544002
1          8.0    81.216000    84.063999      70.303999    70.175998
2         16.0    82.336001    85.023999      76.223999    64.640000
3         32.0    80.991998    83.552003      64.000003    59.136000
4         64.0    82.064003    84.991999      62.944002    57.760000
5        128.0    77.087998    80.031998      97.952001    61.152000
6        256.0   105.343997   107.391998     143.040001    87.296002
7        512.0   199.167997   197.104007     271.295995   138.144001
8       1024.0   399.904013   378.847986     537.728012   277.904004
9       2048.0   800.607979   767.359972    1053.311944   556.544006
10      4096.0  1619.567990  1522.143960    2200.223923  1127.392054
deepseek-ai/DeepSeek-V3 N=32768 K=512: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   16.416000   19.136000      15.232000   13.024000
1          8.0   16.031999   19.136000      13.760000   12.768000
2         16.0   16.384000   18.848000      13.344000   12.960000
3         32.0   16.384000   18.751999      13.152000   13.088000
4         64.0   16.160000   18.719999      13.376000   13.248000
5        128.0   16.096000   18.751999      18.271999   13.824000
6        256.0   21.312000   23.871999      28.255999   18.751999
7        512.0   34.784000   36.256000      46.239998   28.767999
8       1024.0   62.399998   61.439998      82.047999   46.944000
9       2048.0  116.159998  110.799998     153.408006   85.919999
10      4096.0  223.072007  207.519993     295.664012  165.184006
deepseek-ai/DeepSeek-V3 N=7168 K=16384: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton    deepgemm
0          1.0    76.768003    78.960001     112.512000   60.224000
1          8.0    76.031998    78.432001      72.480001   57.856001
2         16.0    76.127999    78.624003      66.944003   57.087999
3         32.0    76.608002    78.720003      70.303999   50.080001
4         64.0    76.320000    79.039998      70.656002   41.887999
5        128.0    76.320000    78.720003      82.847998   53.247999
6        256.0    77.504002    80.416001     107.311994   58.143999
7        512.0   149.471998   150.624007     200.895995  117.919996
8       1024.0   289.696008   285.472006     471.136004  234.623998
9       2048.0   528.591990   522.607982     755.904019  485.760003
10      4096.0  1084.959984  1067.872047    1471.087933  733.471990
deepseek-ai/DeepSeek-V3 N=7168 K=18432: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton    deepgemm
0          1.0    85.568003    88.096000     124.224000   65.215997
1          8.0    84.384002    87.007999      83.807997   63.423999
2         16.0    84.799998    86.176001      77.376001   61.567999
3         32.0    84.384002    86.400002      76.800004   54.400001
4         64.0    84.192000    86.592004      78.079998   45.759998
5        128.0    84.063999    86.687997      94.176002   55.936001
6        256.0    86.560003    89.120001     118.752003   66.143997
7        512.0   166.207999   167.104006     220.912009  124.191999
8       1024.0   348.863989   317.519993     514.559984  247.615993
9       2048.0   601.472020   588.096023     827.552021  517.664015
10      4096.0  1282.240033  1237.583995    1631.872058  824.751973
deepseek-ai/DeepSeek-V3 N=4608 K=7168: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   38.015999   41.407999      50.207999   29.023999
1          8.0   37.856001   40.768001      33.824001   26.912000
2         16.0   37.856001   40.768001      33.440001   25.312001
3         32.0   37.951998   40.895998      33.216000   21.919999
4         64.0   37.856001   40.959999      33.535998   19.808000
5        128.0   38.079999   41.120000      40.927999   21.824000
6        256.0   38.431998   41.536000      48.255999   26.752001
7        512.0   69.600001   71.584001      81.408001   36.607999
8       1024.0  101.375997  102.016002     135.263994   65.792002
9       2048.0  164.287999  160.607994     215.903997  114.271998
10      4096.0  299.392015  293.951988     409.103990  269.407988
deepseek-ai/DeepSeek-V3 N=3072 K=7168: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   37.567999   40.320002      41.855998   25.408000
1          8.0   37.471998   40.256001      29.503999   25.312001
2         16.0   37.503999   40.256001      29.279999   23.903999
3         32.0   37.535999   40.383998      29.023999   19.711999
4         64.0   37.471998   40.352002      29.216001   16.319999
5        128.0   37.696000   40.608000      32.000002   17.824000
6        256.0   37.728000   40.895998      40.128000   25.024001
7        512.0   38.304001   41.216001      48.928000   28.672000
8       1024.0   69.215998   71.392000      91.392003   41.664001
9       2048.0  101.807997  102.784000     136.255994   76.959997
10      4096.0  196.544006  198.016003     271.488011  132.927999
deepseek-ai/DeepSeek-V3 N=4096 K=512: 
fp8 blockwise scaled matmul:
    batch_size       vllm  sgl-kernel  sglang triton  deepgemm
0          1.0  10.272000   13.376000       9.312000     8.192
1          8.0  10.240000   13.248000       8.480000     7.936
2         16.0  10.272000   13.280000       8.288000     7.808
3         32.0  10.432000   13.280000       8.288000     8.000
4         64.0  10.272000   13.280000       8.160000     8.000
5        128.0  10.464000   13.296000       8.608000     8.032
6        256.0  10.384000   13.312000       9.504000     8.608
7        512.0  10.624000   13.504000      10.624000     9.792
8       1024.0  13.760000   16.287999      16.368000    12.128
9       2048.0  20.927999   23.040000      26.016001    17.600
10      4096.0  34.784000   36.384001      44.480000    27.424
deepseek-ai/DeepSeek-V3 N=3072 K=1536: 
fp8 blockwise scaled matmul:
    batch_size       vllm  sgl-kernel  sglang triton   deepgemm
0          1.0  14.528000   17.247999      14.336000  10.336000
1          8.0  14.656000   17.247999      12.032000  10.016000
2         16.0  14.560000   17.247999      11.584000   9.872000
3         32.0  14.528000   17.279999      11.744000   9.376000
4         64.0  14.592000   17.312000      11.680000   9.376000
5        128.0  14.656000   17.376000      12.320000   9.664000
6        256.0  14.816000   17.472001      13.792000  10.464000
7        512.0  14.976000   17.535999      16.480001  11.840000
8       1024.0  22.496000   24.831999      27.680000  16.287999
9       2048.0  30.463999   32.127999      38.015999  23.808001
10      4096.0  53.024001   54.912001      68.672001  37.632000
deepseek-ai/DeepSeek-V3 N=512 K=7168: 
fp8 blockwise scaled matmul:
    batch_size       vllm  sgl-kernel  sglang triton   deepgemm
0          1.0  20.608000   31.872001      31.615999  13.632000
1          8.0  20.608000   31.904001      26.912000  13.248000
2         16.0  20.703999   31.904001      26.944000  12.896000
3         32.0  20.768000   31.840000      26.784001  12.864000
4         64.0  20.752000   31.679999      26.912000  12.928000
5        128.0  20.736000   32.000002      26.784001  12.896000
6        256.0  21.056000   32.320000      26.784001  13.120000
7        512.0  21.792000   32.703999      27.295999  15.456000
8       1024.0  26.720000   31.840000      30.719999  16.448000
9       2048.0  33.535998   36.031999      39.551999  21.888001
10      4096.0  46.656001   44.000000      50.719999  29.983999
deepseek-ai/DeepSeek-V3 N=7168 K=2304: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   18.912001   21.919999      22.399999   14.528000
1          8.0   18.464001   21.376001      17.824000   14.176000
2         16.0   18.592000   21.504000      17.344000   14.048000
3         32.0   18.528000   21.407999      17.152000   13.696000
4         64.0   18.464001   21.504000      17.088000   13.312000
5        128.0   18.560000   21.632001      18.848000   14.720000
6        256.0   18.688001   21.792000      22.431999   16.416000
7        512.0   29.279999   31.583998      36.800001   24.224000
8       1024.0   50.624002   50.976001      77.312000   39.584000
9       2048.0   82.911998   80.991998     111.040004   71.199998
10      4096.0  158.976004  152.224004     218.096003  108.319998
deepseek-ai/DeepSeek-V3 N=7168 K=2048: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   17.696001   20.703999      21.760000   12.704000
1          8.0   17.216001   20.191999      14.816000   12.768000
2         16.0   17.600000   20.479999      14.624000   12.320000
3         32.0   17.408000   20.191999      14.592000   11.712000
4         64.0   17.376000   20.288000      14.688000   11.744000
5        128.0   17.535999   20.400001      16.543999   12.800000
6        256.0   17.535999   20.544000      20.096000   14.624000
7        512.0   26.912000   29.440001      33.087999   22.848001
8       1024.0   45.919999   46.688002      70.015997   38.240001
9       2048.0   75.456001   73.536001     101.888001   68.063997
10      4096.0  144.191995  137.503996     193.376005  102.080002
deepseek-ai/DeepSeek-V3 N=7168 K=256: 
fp8 blockwise scaled matmul:
    batch_size       vllm  sgl-kernel  sglang triton   deepgemm
0          1.0   9.440000   12.512000       8.736000   8.448000
1          8.0   9.408000   12.448000       7.616000   8.000000
2         16.0   9.408000   12.416000       7.584000   8.192000
3         32.0   9.504000   12.320000       7.584000   8.160000
4         64.0   9.504000   12.352000       7.904000   8.192000
5        128.0   9.408000   12.224000       8.480000   8.480000
6        256.0   9.760000   12.544000       9.568000   9.056000
7        512.0  11.904000   14.496000      12.992000  11.424000
8       1024.0  16.352000   18.848000      23.232000  15.104000
9       2048.0  23.808001   25.599999      34.272000  22.752000
10      4096.0  40.832002   41.824002      61.471999  31.615999
Benchmark finished!

pr:

DeepSeek-V3
tp=8

INFO 04-28 06:38:20 [__init__.py:239] Automatically detected platform cuda.
Skip N=576, K=7168 now
deepseek-ai/DeepSeek-V3 N=24576 K=7168: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton     deepgemm
0          1.0    89.887999    92.224002      85.184000    76.416001
1          8.0    81.216000    83.456002      70.239998    70.592001
2         16.0    82.336001    84.608003      76.352000    65.056004
3         32.0    81.055999    83.071999      64.127997    59.583999
4         64.0    82.479998    84.352002      63.327998    57.952002
5        128.0    77.632003    79.935998      98.240003    61.360002
6        256.0   105.503999   106.112003     143.616006    86.687997
7        512.0   201.072007   197.392002     272.015989   136.656001
8       1024.0   395.135999   386.496007     536.880016   272.992015
9       2048.0   793.167949   762.272000    1054.816008   566.496015
10      4096.0  1594.303966  1581.055999    2187.151909  1143.615961
deepseek-ai/DeepSeek-V3 N=32768 K=512: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   16.480001   19.328000      15.840000   13.024000
1          8.0   16.128000   19.424001      13.696000   12.928000
2         16.0   15.968001   19.136000      13.536000   13.184000
3         32.0   16.224001   19.168001      13.280000   13.056000
4         64.0   16.000001   19.040000      13.728000   13.200000
5        128.0   16.128000   19.231999      18.400000   14.176000
6        256.0   21.536000   23.968000      28.287999   19.168001
7        512.0   34.880001   35.999998      46.271998   29.023999
8       1024.0   62.368002   60.256001      82.751997   47.488000
9       2048.0  117.215998  107.808001     155.328006   85.727997
10      4096.0  223.616004  199.647993     295.520008  165.408000
deepseek-ai/DeepSeek-V3 N=7168 K=16384: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton    deepgemm
0          1.0    77.183999    76.991998     112.896003   60.384002
1          8.0    75.935997    76.640002      72.800003   58.240000
2         16.0    76.448001    76.672003      67.167997   57.023998
3         32.0    76.768003    76.831996      70.656002   50.239999
4         64.0    76.448001    77.151999      71.071997   41.855998
5        128.0    75.712003    76.448001      82.815997   53.376000
6        256.0    77.791996    78.272000     108.255997   58.432002
7        512.0   149.215996   147.264004     203.040004  118.207999
8       1024.0   289.664000   285.135984     474.687994  230.463997
9       2048.0   536.095977   511.615992     710.655987  466.239989
10      4096.0  1084.720016  1057.407975    1456.592083  710.687995
deepseek-ai/DeepSeek-V3 N=7168 K=18432: 
fp8 blockwise scaled matmul:
    batch_size         vllm   sgl-kernel  sglang triton    deepgemm
0          1.0    85.759997    85.663997     124.767996   65.600000
1          8.0    84.608003    85.023999      83.743997   63.840002
2         16.0    84.895998    84.927998      77.568002   61.535999
3         32.0    85.440002    85.120000      77.087998   54.623999
4         64.0    84.863998    84.991999      78.272000   46.176001
5        128.0    84.959999    85.199997      95.168002   55.872001
6        256.0    86.528003    87.583996     118.592001   66.367999
7        512.0   167.104006   164.031997     221.103996  124.512002
8       1024.0   329.216003   317.535996     513.599992  255.488008
9       2048.0   593.631983   577.744007     799.504042  493.696004
10      4096.0  1387.359977  1252.544045    1652.799964  782.815993
deepseek-ai/DeepSeek-V3 N=4608 K=7168: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   38.304001   40.031999      50.687999   28.928000
1          8.0   38.112000   39.744001      33.856001   26.815999
2         16.0   37.919998   39.744001      33.440001   25.728000
3         32.0   37.888002   39.840002      33.599999   22.112001
4         64.0   38.015999   39.840002      33.696000   20.000000
5        128.0   38.368002   40.031999      41.343998   22.016000
6        256.0   38.431998   40.640000      48.160002   26.880000
7        512.0   69.664001   69.343999      81.696004   36.543999
8       1024.0  101.760000   99.840000     135.008007   65.632001
9       2048.0  164.287999  161.200002     216.064006  114.784002
10      4096.0  298.752010  294.528008     406.816006  273.056000
deepseek-ai/DeepSeek-V3 N=3072 K=7168: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   37.664000   39.584000      42.048000   25.615999
1          8.0   37.439998   39.296001      29.696001   25.248000
2         16.0   37.728000   39.296001      29.503999   23.776000
3         32.0   37.503999   39.328001      29.408000   20.128001
4         64.0   37.567999   39.360002      29.600000   16.448000
5        128.0   37.983999   39.551999      32.032002   18.208001
6        256.0   38.112000   40.064000      40.320002   25.024001
7        512.0   38.527999   40.544000      48.928000   28.640000
8       1024.0   69.728002   70.271999      92.416003   41.664001
9       2048.0  102.112003  102.527998     136.575997   76.704003
10      4096.0  195.968002  193.599999     269.760013  133.184001
deepseek-ai/DeepSeek-V3 N=4096 K=512: 
fp8 blockwise scaled matmul:
    batch_size       vllm  sgl-kernel  sglang triton   deepgemm
0          1.0  10.432000   13.728000       9.696000   8.160000
1          8.0  10.560000   13.696000       8.704000   8.096000
2         16.0  10.432000   13.728000       8.544000   7.936000
3         32.0  10.656000   13.760000       8.480000   7.936000
4         64.0  10.720000   13.792000       8.352000   8.192000
5        128.0  10.464000   13.824000       8.800000   8.416000
6        256.0  10.528000   13.696000       9.696000   8.960000
7        512.0  10.912000   13.856000      11.104000   9.952000
8       1024.0  13.984000   16.767999      15.936000  12.256000
9       2048.0  21.215999   23.232000      25.888000  17.856000
10      4096.0  35.168000   35.872001      44.767998  27.712001
deepseek-ai/DeepSeek-V3 N=3072 K=1536: 
fp8 blockwise scaled matmul:
    batch_size       vllm  sgl-kernel  sglang triton   deepgemm
0          1.0  14.752000   17.472001      14.368000  10.544000
1          8.0  14.912000   17.472001      12.000000  10.432000
2         16.0  14.912000   17.472001      11.776000  10.112000
3         32.0  14.944000   17.503999      11.936000   9.536000
4         64.0  14.816000   17.519999      11.840000   9.600000
5        128.0  14.880000   17.600000      12.704000   9.664000
6        256.0  14.976000   17.728001      14.112000  10.400000
7        512.0  15.168000   17.792000      16.608000  11.968000
8       1024.0  22.720000   24.896000      27.840000  16.319999
9       2048.0  30.624000   32.352000      38.240001  24.224000
10      4096.0  53.952001   53.856000      69.343999  37.664000
deepseek-ai/DeepSeek-V3 N=512 K=7168: 
fp8 blockwise scaled matmul:
    batch_size       vllm  sgl-kernel  sglang triton   deepgemm
0          1.0  20.927999   31.599998      31.840000  13.824000
1          8.0  20.927999   31.615999      27.104000  13.472000
2         16.0  20.959999   31.583998      27.168000  13.248000
3         32.0  20.768000   31.552002      27.200000  13.184000
4         64.0  20.864001   31.328000      27.071999  13.152000
5        128.0  20.768000   31.711999      26.815999  13.056000
6        256.0  21.088000   32.000002      27.008001  13.472000
7        512.0  22.112001   32.288000      27.327999  16.096000
8       1024.0  26.848000   30.975999      30.880000  16.256001
9       2048.0  33.376001   34.432001      39.519999  21.663999
10      4096.0  46.751998   41.983999      50.912000  30.304000
deepseek-ai/DeepSeek-V3 N=7168 K=2304: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   19.200001   21.919999      22.399999   14.720000
1          8.0   18.448001   21.312000      17.824000   14.336000
2         16.0   18.624000   21.472000      17.535999   14.464000
3         32.0   18.560000   21.344000      17.152000   13.728000
4         64.0   18.719999   21.472000      17.472001   13.504000
5        128.0   18.688001   21.568000      19.040000   14.912000
6        256.0   19.072000   21.856001      22.431999   15.936000
7        512.0   29.536000   31.520002      37.280001   24.480000
8       1024.0   51.040001   51.584002      78.111999   39.935999
9       2048.0   83.296001   82.336001     111.040004   71.616001
10      4096.0  159.199998  155.103996     216.495991  109.151997
deepseek-ai/DeepSeek-V3 N=7168 K=2048: 
fp8 blockwise scaled matmul:
    batch_size        vllm  sgl-kernel  sglang triton    deepgemm
0          1.0   18.144000   20.768000      21.663999   13.056000
1          8.0   17.632000   20.288000      15.264000   12.832000
2         16.0   17.888000   20.512000      15.072000   12.704000
3         32.0   17.664000   20.384001      14.848000   11.872000
4         64.0   17.568000   20.416001      15.008000   11.936000
5        128.0   17.759999   20.479999      16.896000   12.960000
6        256.0   17.888000   20.640001      20.320000   14.848000
7        512.0   27.264001   29.376000      33.312000   23.391999
8       1024.0   46.335999   47.488000      70.688002   38.400002
9       2048.0   75.552002   74.816003     102.176003   68.351999
10      4096.0  144.127995  141.072005     193.248004  102.240004
deepseek-ai/DeepSeek-V3 N=7168 K=256: 
fp8 blockwise scaled matmul:
    batch_size       vllm  sgl-kernel  sglang triton   deepgemm
0          1.0   9.568000   12.960000       8.928000   8.608000
1          8.0   9.792000   12.928000       7.776000   8.416000
2         16.0   9.536000   12.928000       8.000000   8.416000
3         32.0   9.504000   12.800000       8.000000   8.160000
4         64.0   9.504000   12.800000       8.064000   8.160000
5        128.0   9.632000   12.768000       8.704000   8.672000
6        256.0   9.568000   12.992000       9.888000   9.280000
7        512.0  11.936000   14.912000      13.184000  11.520000
8       1024.0  16.480001   19.136000      23.552001  15.456000
9       2048.0  24.032000   25.728000      34.655999  22.720000
10      4096.0  40.959999   40.895998      61.503999  31.552002
Benchmark finished!

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

elfiegg · 2025-04-28T17:14:07Z

LGTM

QQ: What is the metric of the benchmark? I'm wondering if it's latency or throughput?

zhyncs · 2025-04-29T04:51:48Z

Hi @elfiegg

sglang/sgl-kernel/benchmark/bench_fp8_blockwise_gemm.py

Lines 115 to 146 in 8d463fe

    
           if provider == "sgl-kernel": 
        
               scale_a = scale_a.t().contiguous().t() 
        
               b_fp8, scale_b = b_fp8.t(), scale_b.t() 
        
               ms, min_ms, max_ms = triton.testing.do_bench( 
        
                   lambda: fp8_blockwise_scaled_mm( 
        
                       a_fp8, b_fp8, scale_a, scale_b, torch.float16 
        
                   ), 
        
                   quantiles=quantiles, 
        
               ) 
        
           if provider == "vllm": 
        
               scale_a = scale_a.t().contiguous().t() 
        
               b_fp8, scale_b = b_fp8.t(), scale_b.t() 
        
               ms, min_ms, max_ms = triton.testing.do_bench( 
        
                   lambda: vllm_scaled_mm(a_fp8, b_fp8, scale_a, scale_b, torch.float16), 
        
                   quantiles=quantiles, 
        
               ) 
        
           if provider == "triton": 
        
               ms, min_ms, max_ms = triton.testing.do_bench( 
        
                   lambda: w8a8_block_fp8_matmul( 
        
                       a_fp8, b_fp8, scale_a, scale_b, [128, 128], torch.float16 
        
                   ), 
        
                   quantiles=quantiles, 
        
               ) 
        
           if provider == "deepgemm": 
        
               scale_a_col_major = get_col_major_tma_aligned_tensor(scale_a.clone()) 
        
               ms, min_ms, max_ms = triton.testing.do_bench( 
        
                   lambda: fp8_gemm_deepgemm( 
        
                       a_fp8, scale_a_col_major, b_fp8, scale_b, M, N, K 
        
                   ), 
        
                   quantiles=quantiles, 
        
               ) 
        
           return ms * 1000, max_ms * 1000, min_ms * 1000  # convert to ms

BBuf · 2025-04-29T04:59:48Z

LGTM

QQ: What is the metric of the benchmark? I'm wondering if it's latency or throughput?

It is latency.

* Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (sgl-project#5728) * [fix] fix potential bumpy throughtput with deepgemm (sgl-project#5722) * Resolves the `404 Not Found` error when running `compile_deep_gemm.py` in multi-node setups (sgl-project#5720) * perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (sgl-project#5716) * we fix the non existent access of `decrypted_config_file` (sgl-project#5685) * CI: rewrite test_vision_chunked_prefill to speedup (sgl-project#5682) * Fuse MLA set kv cache kernel (sgl-project#5748) * Update amd docker image to `sglang:v0.4.5.post3-rocm630`. (sgl-project#5697) * [feature] support for roberta embedding models (sgl-project#5730) * [fix] fix bench_one_batch_server (sgl-project#5607) * support for the DeepSeek model by enabling streaming response parsing (sgl-project#5592) * fix: Use `is not None` instead of `!= None` for None checks. (sgl-project#5687) * Add Llama 4 to FA3 test (sgl-project#5509) * [misc] more decode step log for batch_one_batch (sgl-project#5565) * Handle JSONDecodeError while processing request data (sgl-project#5599) * fix(srt): check if sample_indices is not None before usage. (sgl-project#5633) * update llguidance to 0.7.11; adds StructTag (sgl-project#4870) * Use sgl-kernel sgl_per_token_group_quant_int8 (sgl-project#4971) * Add memory_saver check (sgl-project#4986) Signed-off-by: Kebe <mail@kebe7jun.com> * add switch to disable open api doc (sgl-project#3744) Signed-off-by: congcongke <zhanweidu@163.com> * Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (sgl-project#5772) * Fix eagle test case (sgl-project#5776) * Split local attention test from fa3 test (sgl-project#5774) * Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (sgl-project#5777) * Simplify FA3 tests (sgl-project#5779) * Revert "[fix] fix bench_one_batch_server" (sgl-project#5785) * Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (sgl-project#5786) * [CI] Tune threshold (sgl-project#5787) * [CI] fix port conflicts (sgl-project#5789) * [CI] Fix ci tests (sgl-project#5769) * [PD]Reduce kv transfer threads (sgl-project#5791) * [CI] Fix test case (sgl-project#5790) * Add 8-GPU Test for Deepseek-V3 (sgl-project#5691) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * Release v0.4.6 (sgl-project#5795) * Update nightly-test.yml (sgl-project#5797) * [CI] Improve github summary & enable fa3 for more models (sgl-project#5796) * [Docs] update grafana setup guide in production metrics (sgl-project#5643) Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com> * [Misc] add structure logging, write to file and log tracing for SGL Router * Improve overlap scheduling (sgl-project#5788) * Add Cutlass MLA attention backend (sgl-project#5390) * chore: upgrade sgl-kernel 0.1.0 (sgl-project#5690) * Dockerfile.dev pip scikit_build_core (sgl-project#5807) * Add a doc to fix sgl-kernel build link error in py39 with ccache (sgl-project#5809) * Turn on overlap scheduler for multimodal models (sgl-project#5771) * Tiny refactor DefaultModelLoader.Source (sgl-project#5482) * [Docs] Replace lists with tables for cleanup and readability in server_arguments (sgl-project#5276) * Revert "Tiny refactor DefaultModelLoader.Source" (sgl-project#5825) * Feat: add support for thinking mode via chat_template_kwargs.enable_t… (sgl-project#5551) Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> * fix: fix the error where the content is None when reasoning and tool … (sgl-project#5838) * feat: Add fused moe triton config for qwen3 moe on h100 (sgl-project#5833) * fused moe triton tuning script support qwen3 (sgl-project#5842) * feat: Add fused moe triton config for qwen3bf16 moe on h20 (sgl-project#5839) * [PD] support pd fake transfer for warmup (sgl-project#5726) * [config] qwen3moe_tune_h20 fp8 tp4 (sgl-project#5846) * [Doc] Recover history of server_arguments.md (sgl-project#5851) * feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (sgl-project#5850) * [CI] test chunked prefill more (sgl-project#5798) * ROCm: update AITER (sgl-project#5816) * [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (sgl-project#5847) Co-authored-by: sighingnow <sighingnow@gmail.com> * [Fix] Missing bootstrap_port field (sgl-project#5823) * feat: update is_fa3_default_architecture (sgl-project#5854) * add fused moe config for qwen3moe fp8/bf16 (sgl-project#5849) * chore: bump v0.4.6.post1 (sgl-project#5845) * Support `max_completion_tokens` for OpenAIChatCompletions (sgl-project#5857) * simplify fused_moe config logging (sgl-project#5801) * [CI] tune the test order to warmup the server (sgl-project#5860) * Cutlass MLA decode - fix dtype error (sgl-project#5868) * cutlass 3.9 supported to improve fp8_blockwise_gemm (sgl-project#5820) * [Feature] support auto chat template (sgl-project#4949) * Feat: support cuda graph for LoRA (sgl-project#4115) Co-authored-by: Beichen Ma <mabeichen12@gmail.com> * Add qwen3 30b fused moe config (sgl-project#5859) * [Fix] Fix a bug for flashmla to run R1 model (sgl-project#5875) Co-authored-by: pengcuo <dgpengcuo@gmail.com> * Add A800 fused moe config for qwen3 30b (sgl-project#5880) * [Misc] add service discovery for sgl router * [fix]: PyO3 macOS linking and consolidate on tracing for logging * chore: update Dockerfile (sgl-project#5894) * [Docs] Update docs for Qwen3 and Qwen3MoE (sgl-project#5836) * [Doc] Tables instead of bulletpoints for sampling doc (sgl-project#5841) * chore: update CODEOWNERS (sgl-project#5895) * [FEATURE] Enhance platform compatibility for ARM (sgl-project#5746) * [CI] Add test_function_calling.py to run_suite.py (sgl-project#5896) * Auto set draft model path for MTP (sgl-project#5793) * [fix] relax mem_fraction_static for h200 (sgl-project#5893) Co-authored-by: alcanerian <alcanerian@gmail.com> * feat: support pythonic tool call and index in tool call streaming (sgl-project#5725) * [Bugfix]: fix missing queue_time_start for requests from grammar_queue (sgl-project#5696) * Add AMD MI300x Nightly Testing. (sgl-project#5861) * chore: use torch 2.6 for sgl-kernel build (sgl-project#5898) * Fix check_env script (sgl-project#5901) * [PD] Fix Assertion failed: /DeepEP/csrc/kernels/internode.cu:483, condition: ibgda_get_state()->num_rc_per_pe >= num_channels sgl-project#134 (sgl-project#5830) * Bump Flashinfer to 0.2.5 (sgl-project#5870) Co-authored-by: Yuhao Chen <yxckeis8@gmail.com> * [Fix] Unload lora in HF_Runner if needed (sgl-project#5899) * Add A800 fused moe config for qwen3 235b (sgl-project#5900) * Add sm_120 for blackwell (sgl-project#5903) * [Feature] add support kimi vl model (sgl-project#5383) Co-authored-by: wenju.li <wenju.li@deepctr.cn> * support vlm benchmark profile (sgl-project#5905) * [fix] kimi-vl test in test_vision_openai_server.py (sgl-project#5910) * [Misc] use parallel build for cmake in sgl-kernel (sgl-project#5919) * [qwen3] support qwen3 ep moe (sgl-project#5917) Co-authored-by: sleepcoo <sleepcoo@gmail.com> * Add TP2 MOE benchmarks for AMD. (sgl-project#5909) * [Feat] Scale up fa3 kernel to sm8x arch (sgl-project#5912) Co-authored-by: zhyncs <me@zhyncs.com> * chore: bump sgl-kernel 0.1.1 (sgl-project#5932) * chore: upgrade sgl-kernel 0.1.1 (sgl-project#5933) * Remove unused method `calculate_num_image_tokens` from qwen2_vl.py (sgl-project#5783) * [PP] Add pipeline parallelism (sgl-project#5724) * Fix lora batch processing when input lora_path contains None (sgl-project#5930) * add Thor & Spark (sgl-project#5915) * fix: correct stream response when enable_thinking is set to false (sgl-project#5881) * fix: update model runner (sgl-project#5934) * chore: bump v0.4.6.post2 (sgl-project#5939) * Support XiaomiMiMo/MiMo model inference (sgl-project#5921) * [PD] Vectorise group_concurrent_contiguous in NumPy (sgl-project#5834) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Remove extra contiguous (sgl-project#5953) * Update ci test and doc for MTP api change (sgl-project#5952) * docs: Fix Qwen model typo (sgl-project#5944) Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * Optimize a pad operation to accelerate 25us (sgl-project#5945) * Properly return error response in vertex_generate HTTP endpoint (sgl-project#5956) * feat: add concurrency evaluation logic in mmmu benchmark (sgl-project#5782) * Add 1 gpu perf and 2 gpu accuracy tests for AMD MI300x CI. (sgl-project#5960) * feat: Refactor DeepSeekV3 function call (sgl-project#5908) * Remove token in token out in Native API (sgl-project#5967) * Support InternVL3 (sgl-project#5350) Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> * Support MMMU benchmark for InternVL (sgl-project#5968) * FA3 speed up: skip len operation and get batch size directly from forward batch (sgl-project#5969) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * [PD] NIXL backend Prefill TP & Decode TP+DP (sgl-project#5681) * Fix set kv cache multi-stream (sgl-project#5975) * Overlap qk norm with two streams (sgl-project#5977) * fix: only upgrade nccl for cu128 (sgl-project#5986) * Fix Phi3 serving which was broke by earlier change (sgl-project#5991) Co-authored-by: Lifu Huang <lifu.hlf@gmail.com> * [perf] H100 DeepSeek-V3 fused moe tuned config (sgl-project#5998) * [Fix] Suppress dynamo logging when using flashinfer backend with torch compile (sgl-project#5992) * [Minor] Fix duplicate method definitions in conversation.py (sgl-project#6012) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * Fix flaky issues of lora and add multi batch tests (sgl-project#5957) * Tool Call: Add `chat_template_kwargs` documentation (sgl-project#5679) * fix: fix broadcast_pyobj breaking VerlEngine (sgl-project#5997) * [PD] Allow customizing reserved tokens to avoid KV cache waste (sgl-project#6002) * Update dev container config to support live code sync and improve docker setup guide (sgl-project#6018) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * [PD] Optimize disaggregation ib device help info (sgl-project#5781) * [Test] Add flashmla attention backend test (sgl-project#5587) * Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" (sgl-project#5555) * feat: Add a unified merge_state API (sgl-project#5428) * feat: append more comprehensive fields in messages instead of merely role and content (sgl-project#5996) * [Security][Bug] Prevent binding to all TCP interfaces (sgl-project#5752) * Fix prefill OOM error in the case of large page size (sgl-project#5081) * Fix problem of large page size with chunked prefill (sgl-project#6046) * docs: add Google Cloud Vertex AI in Adoption and Sponsorship (sgl-project#6047) * docs: add new blog (sgl-project#6048) * Fix not "import os" (sgl-project#6057) * Better PD initialization (sgl-project#5751) * fix: deepep dockerfile, use pip install deepep. (sgl-project#5885) * [Fix] Fix and rename flashmla CI test (sgl-project#6045) * chore: upgrade cutlass 3.9.2 (sgl-project#6004) Co-authored-by: yizhang2077 <1109276519@qq.com> * Fix sgl-kernel build on aarch64 platforms (sgl-project#6062) * Add DeepEP to CI PR Test (sgl-project#5655) Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> * fix custom_allreduce namespace (sgl-project#6039) * feat: add release workflow for SGLang kernels on aarch64 (sgl-project#6010) Co-authored-by: Qiaolin-Yu <liin1211@outlook.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> * [Feature] Support for Ascend NPU backend (sgl-project#3853) Signed-off-by: Song Zhang <gepin.zs@antgroup.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> * Fix the timeout for 8 gpu tests (sgl-project#6084) * Hint users DeepEP normal mode is incompatible with CUDA Graph (sgl-project#5014) * Super tiny fix doc (sgl-project#5233) * [Doc]Fix description for dp_size argument (sgl-project#6063) * feat(engine): add bootstrap parameters to generate methods (dynamo) (sgl-project#6075) * [refactor] slightly tidy fp8 module (sgl-project#5993) * Clean up fa3 test from 8 gpus (sgl-project#6105) * Deferring 8 GPU test (sgl-project#6102) * Update doc for MLA attention backends (sgl-project#6034) * Clean logs for DeepSeek-V3 launching (sgl-project#6079) * [CI]Add performance CI for VLM (sgl-project#6038) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * adding Triton configs for DeepSeekV3 FusedMoE kernel on Blackwell (sgl-project#6111) * optimize pad operations in fa3 to accelarate 100+us (sgl-project#6077) * Overlap shared expert and routed expert computations (sgl-project#5121) * Tiny refactor ModelConfig.from_server_args (sgl-project#5219) * Tiny refactor weight loading logic (sgl-project#5232) * [PD] Add control to slow down a server (sgl-project#5572) * Change AMD test threshold (sgl-project#6091) * DeepEP normal support deepgemm-contiguous (sgl-project#5626) Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Xuting Zhou <xutingz@nvidia.com> Co-authored-by: ZhengHSI <zhenghsi@qq.com> * [fix] fix pyproject.toml dependencies (sgl-project#6119) * [Feature] Add FlashAttention3 as a backend for VisionAttention (sgl-project#5764) Co-authored-by: othame <chenzhu_912@zju.edu.cn> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yi Zhang <1109276519@qq.com> * [perf] dsv3 bmm fallback to bf16 (sgl-project#5662) * [AMD] switch to custom allreduce regardless of MSCCL setting on ROCm (sgl-project#6097) * [sgl-kernel] fix: fix cu118 compile error (sgl-project#6123) Co-authored-by: zhyncs <me@zhyncs.com> * upgrade xgrammar to 0.1.19 (sgl-project#6129) * Remove unecessary is_fa3_supported check (sgl-project#6112) * chore: bump sgl-kernel 0.1.2 (sgl-project#6131) * docs: update README (sgl-project#6132) * [Fix] Incorrect Memory Allocation on CUDA:0 by Non-Zero CUDA Processes in TP/DP (sgl-project#5745) * Cutlass MLA: Disable split kv due to NVIDIA/cutlass#2274 (sgl-project#6101) * opt flashinfer mla cat (sgl-project#5822) Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com> * Update amd nightly concurrency. (sgl-project#6141) * feat: add thinking_budget (sgl-project#6089) * [Bugfix] Fix Llama4 gibberish output with long context and CUDA graph (sgl-project#6162) * fix bug that gpu0 occupies more memory when hicache is turned on (sgl-project#5778) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * chore: bump v0.4.6.post3 (sgl-project#6165) * KV‑Cache (MHA, MLA): add missing start_layer / end_layer fields to MHATokenToKVPoolHost and MLATokenToKVPoolHost (sgl-project#6016) Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com> Co-authored-by: chus-chus <chus-chus@users.noreply.github.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * [fix] fix determine_n_share_experts_fusion (sgl-project#6118) * Fix and Clean up chat-template requirement for VLM (sgl-project#6114) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * [Docs]Delete duplicate content (sgl-project#6146) Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> * Revert "feat: add thinking_budget (sgl-project#6089)" (sgl-project#6181) * Added async_encode method to Engine (sgl-project#4701) * Fix data parallel perf regression (sgl-project#6183) * Fix request abortion (sgl-project#6184) * Add typo checker in pre-commit (sgl-project#6179) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> * Remove duplicate IO Struct test (sgl-project#6180) Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> * [PD] Add simple unit test for disaggregation feature (sgl-project#5654) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [CI] Disabled deepep tests temporarily because it takes too much time. (sgl-project#6186) * feat: support loogle eval (sgl-project#6190) * [fix] remove mixtral from is_fa3_default_architecture (sgl-project#6191) * fix: handle None multimodal_inputs during merging and filtering batches in disaggregation decode mode (sgl-project#6169) * chore: upgrade deepgemm (sgl-project#6073) * chore: bump sgl-kernel v0.1.2.post1 (sgl-project#6195) * chore: upgrade sgl-kernel v0.1.2.post1 (sgl-project#6196) Co-authored-by: alcanderian <alcanderian@gmail.com> * Handle empty input string for embedding models (sgl-project#5621) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * doc: fix the erroneous documents and example codes about Alibaba-NLP/gme-Qwen2-VL-2B-Instruct (sgl-project#6199) * [Docs] minor Qwen3 and reasoning parser docs fix (sgl-project#6032) * Improve structured outputs: fix race condition, server crash, metrics and style (sgl-project#6188) * [CI] Reorganize the 8 gpu tests (sgl-project#6192) * Add dev-deepep docker image (sgl-project#6198) * Replace time.time() to time.perf_counter() for benchmarking. (sgl-project#6178) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * Update README.md (sgl-project#6202) * Fix release-docs.yml to not use python 3.9 (sgl-project#6204) * Fix start_profile does not support with_stack and record_shapes (sgl-project#6043) * [doc] add a note for --n-share-experts-fusion args (sgl-project#6154) * Performing Vocabulary Parallelism for LM Head across Attention TP Groups (sgl-project#5558) Co-authored-by: liusy58 <liusy58@linux.alibaba.com> * Update AMD CI docker to v0.4.6.post3-rocm630. (sgl-project#6213) * Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (sgl-project#6201) Co-authored-by: SangBin Cho <rkooo567@gmail.com> * [CI] Fix PD mooncake dependency error (sgl-project#6212) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [CI] Re-enable pd disaggregation test (sgl-project#6231) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * fix some typos (sgl-project#6209) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> * [Docs] Add docs for `SGLANG_` and `SGL_` environment variables (sgl-project#6206) * [PP] Fix init_memory_pool desync & add PP for mixtral (sgl-project#6223) * Revert "fix some typos" (sgl-project#6244) * chore: add hf_xet dep (sgl-project#6243) * Update AMD nightly deps. (sgl-project#6241) * [PD] Add support for different TP sizes per DP rank (sgl-project#5922) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Support incremental streaming of logprob/token_ids between scheduler and detokenizer (sgl-project#6225) Co-authored-by: SangBin Cho <rkooo567@gmail.com> * fix typo (sgl-project#6248) * Support tuning moe for llama 4 model (sgl-project#6042) * Skip the flaky test_stateful_custom_logit_processor (sgl-project#6251) * [Llama4] Add docs note about enable multimodal (sgl-project#6235) * [VERL Use Case] Add torch_memory_saver into deps (sgl-project#6247) * Fix two issues related to `--moe-dense-tp-size=1` (sgl-project#5657) Co-authored-by: liusy58 <liusy58@linux.alibaba.com> Co-authored-by: 颉沆 <xiehang.lsy@alibaba-inc.com> * model(vlm): pixtral (sgl-project#5084) * [misc] deep_gemm fallback to NVRTC when NVCC not found (sgl-project#6252) * Enable MI325X AMD CI. (sgl-project#6259) * chore: bump v0.4.6.post4 (sgl-project#6245) * formatting fix for the rebased commit for 4.6.0_post4 Signed-off-by: Mohit Sinha <msinha@habana.ai> * fix issues in model runner and python packages fix for following issues: > vLLM dependency for xgrammar==0.1.17 > 'Scheduler' object has no attribute 'device > 'pp_proxy_tensors' unexpected arg in HPUGraphRunner > TODO: Add pipeline parallelism support in HPUGraphRunner Signed-off-by: Mohit Sinha <msinha@habana.ai> * fix formatting in model runner Signed-off-by: Mohit Sinha <msinha@habana.ai> * base grammar fix for the is_terminated case > 'OutlinesGrammar' object has no attribute 'is_terminated' Signed-off-by: Mohit Sinha <msinha@habana.ai> --------- Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: congcongke <zhanweidu@163.com> Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> Signed-off-by: Song Zhang <gepin.zs@antgroup.com> Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: Wenxuan Tan <wtan45@wisc.edu> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com> Co-authored-by: vzed <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: DavidBao <121073073+DavidBao03@users.noreply.github.com> Co-authored-by: Frankey_8080 <32973306+Frank-Jie@users.noreply.github.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: yan97ao <580776+yan97ao@users.noreply.github.com> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Michał Moskal <michal@moskal.me> Co-authored-by: lambert0312 <lambert80.ios@gmail.com> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: zhanweidu <zhanweidu@163.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com> Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Michael Yao <haifeng.yao@daocloud.io> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: JiLi <leege233@gmail.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: PGFLMG <1106310035@qq.com> Co-authored-by: sighingnow <sighingnow@gmail.com> Co-authored-by: XTY <xutianyi1999@live.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: woodx <124784234+woodx9@users.noreply.github.com> Co-authored-by: Qiaolin Yu <qy254@cornell.edu> Co-authored-by: Beichen Ma <mabeichen12@gmail.com> Co-authored-by: pengcuo <pengcbupt@163.com> Co-authored-by: pengcuo <dgpengcuo@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: simveit <69345428+simveit@users.noreply.github.com> Co-authored-by: Johnny <johnnync13@gmail.com> Co-authored-by: alcanerian <alcanerian@gmail.com> Co-authored-by: Yuhao Chen <yxckeis8@gmail.com> Co-authored-by: zhjunqin <zhjunqin@users.noreply.github.com> Co-authored-by: liwenju0 <like4hub@gmail.com> Co-authored-by: wenju.li <wenju.li@deepctr.cn> Co-authored-by: laixin <xielx@shanghaitech.edu.cn> Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: 江家瑋 <36886416+JiangJiaWei1103@users.noreply.github.com> Co-authored-by: KCFindstr <shimakaze@google.com> Co-authored-by: xm:D <38322020+xiaomin-D@users.noreply.github.com> Co-authored-by: Lifu Huang <lifu.hlf@gmail.com> Co-authored-by: Yongtong Wu <914554688@qq.com> Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: Hank Han <54751605+HanHan009527@users.noreply.github.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> Co-authored-by: Johnny <johnnynuca14@gmail.com> Co-authored-by: Song Zhang <70674731+botieking98@users.noreply.github.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com> Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: Xuting Zhou <xutingz@nvidia.com> Co-authored-by: ZhengHSI <zhenghsi@qq.com> Co-authored-by: Zhu Chen <51010608+Othame@users.noreply.github.com> Co-authored-by: othame <chenzhu_912@zju.edu.cn> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: Yixin Dong <ubospica@gmail.com> Co-authored-by: xu-yfei <xu_yfei@qq.com> Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com> Co-authored-by: thyecust <tienhoayu@gmail.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: Simon (Jiyou) Li <Simon-Li@users.noreply.github.com> Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com> Co-authored-by: chus-chus <chus-chus@users.noreply.github.com> Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com> Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> Co-authored-by: Steven Shimizu <shimizust@gmail.com> Co-authored-by: applesaucethebun <113181361+applesaucethebun@users.noreply.github.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> Co-authored-by: Yusong Gao <yusong.gao@gmail.com> Co-authored-by: alcanderian <alcanderian@gmail.com> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> Co-authored-by: liusy58 <liusy58@linux.alibaba.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: 颉沆 <xiehang.lsy@alibaba-inc.com> Co-authored-by: Kiv Chen <34561254+KivenChen@users.noreply.github.com>

cutlass 3.9 supported

7143f3f

BBuf requested review from zhyncs, ispobock, HandH1998, yizhang2077, merrymercy and FlamingoPg as code owners April 28, 2025 06:56

upd

3206cd8

zhyncs assigned elfiegg Apr 28, 2025

Merge branch 'main' into cutlass_3.9_support

5d3f0ba

zhyncs approved these changes Apr 29, 2025

View reviewed changes

zhyncs merged commit 5bb0acc into main Apr 29, 2025
12 checks passed

zhyncs deleted the cutlass_3.9_support branch April 29, 2025 04:52

liwenju0 pushed a commit to liwenju0/sglang that referenced this pull request Apr 29, 2025

cutlass 3.9 supported to improve fp8_blockwise_gemm (sgl-project#5820)

d3445ce

RunkaiTao pushed a commit to RunkaiTao/sglang that referenced this pull request May 9, 2025

cutlass 3.9 supported to improve fp8_blockwise_gemm (sgl-project#5820)

e02cca2

xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025

cutlass 3.9 supported to improve fp8_blockwise_gemm (sgl-project#5820)

5024e76

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cutlass 3.9 supported to improve fp8_blockwise_gemm #5820

cutlass 3.9 supported to improve fp8_blockwise_gemm #5820

Uh oh!

BBuf commented Apr 28, 2025

Uh oh!

elfiegg commented Apr 28, 2025

Uh oh!

zhyncs commented Apr 29, 2025

Uh oh!

Uh oh!

BBuf commented Apr 29, 2025

Uh oh!

Uh oh!

cutlass 3.9 supported to improve fp8_blockwise_gemm #5820

cutlass 3.9 supported to improve fp8_blockwise_gemm #5820

Uh oh!

Conversation

BBuf commented Apr 28, 2025

Motivation

Modifications

Checklist

Uh oh!

elfiegg commented Apr 28, 2025

Uh oh!

zhyncs commented Apr 29, 2025

Uh oh!

Uh oh!

BBuf commented Apr 29, 2025

Uh oh!

Uh oh!