reduce moe_align_block_size_kernel small batch mode overhead #5086

BBuf · 2025-04-05T15:48:28Z

Motivation

Fix moe_align_block_size kernel benchmark bug.
Reduce moe_align_block_size_kernel small batch mode ovehead caused by torch.zeros .
After this pr merged and new sgl-kernel version released, I can change the cumsum_buffer allocate from torch.zeros to torch.empty in fused_moe_triton.py. For small token mode, I wrote a new simple kernel to handle it, and it's not need to use torch.zeros to init the cumsum buffer, all things are happed in shm.

Acc test

I set token_cnts_buffer and cumsum_buffer to torch.empty in fused_moe.py:

token_cnts_buffer = torch.empty(
            (num_experts + 1) * num_experts,
            dtype=torch.int32,
            device=topk_ids.device,
        )
        cumsum_buffer = torch.empty(
            num_experts + 1, dtype=torch.int32, device=topk_ids.device
        )

Acc result:

➜  sglang python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8 --port 30001

100%|████████████████████████████████████████████████████████████████████████| 1319/1319 [01:25<00:00, 15.36it/s]
Accuracy: 0.952
Invalid: 0.000
Latency: 90.543 s
Output throughput: 1532.037 token/s

Kernel unit-test

Benchmark In H200

main branch:

📊 Running performance benchmark for 8 experts...
moe-align-block-size-performance:
     num_tokens  num_experts  topk        SGL       Triton        VLLM
0           1.0          8.0   1.0  18.975999    67.359999   14.624000
1           1.0          8.0   2.0  19.136000    23.264000   14.607999
2           1.0          8.0   4.0  20.384001    63.519999   14.624000
3           1.0          8.0   8.0  19.424001    62.368002   14.656000
4           1.0         32.0   1.0  20.128001    62.912002   16.640000
5           1.0         32.0   2.0  21.536000    63.263997   16.640000
6           1.0         32.0   4.0  21.632001    69.023997   16.576000
7           1.0         32.0   8.0  20.256000    59.712000   16.640000
8           1.0         64.0   1.0  22.816001    56.912001   20.128001
9           1.0         64.0   2.0  22.816001    28.672000   20.256000
10          1.0         64.0   4.0  22.784000    69.440000   20.223999
11          1.0         64.0   8.0  22.816001    65.024003   20.288000
12          1.0        128.0   1.0  24.224000    64.511999   30.975999
13          1.0        128.0   2.0  23.040000    68.335995   30.944001
14          1.0        128.0   4.0  24.288001    63.167997   30.944001
15          1.0        128.0   8.0  23.135999    63.808002   31.040000
16          1.0        256.0   1.0  26.559999    58.240000   65.471999
17          1.0        256.0   2.0  26.591999    70.528001   65.471999
18          1.0        256.0   4.0  26.815999    61.471999   65.632001
19          1.0        256.0   8.0  27.872000    59.680000   65.664001
20          8.0          8.0   1.0  19.200001    65.568000   14.688000
21          8.0          8.0   2.0  19.455999    60.447998   14.688000
22          8.0          8.0   4.0  20.320000    69.552004   14.544000
23          8.0          8.0   8.0  20.447999    68.736002   14.720000
24          8.0         32.0   1.0  21.600001    60.208000   16.640000
25          8.0         32.0   2.0  21.663999    59.999999   16.608000
26          8.0         32.0   4.0  20.576000    62.399998   16.576000
27          8.0         32.0   8.0  21.632001    60.095999   16.672000
28          8.0         64.0   1.0  21.376001    27.264001   20.223999
29          8.0         64.0   2.0  21.376001    60.112000   20.320000
30          8.0         64.0   4.0  21.344000    62.816001   20.223999
31          8.0         64.0   8.0  22.911999    27.360000   20.320000
32          8.0        128.0   1.0  24.288001    57.535999   30.975999
33          8.0        128.0   2.0  24.256000    57.856001   31.104000
34          8.0        128.0   4.0  23.135999    64.544000   31.104000
35          8.0        128.0   8.0  24.224000    57.392001   31.136001
36          8.0        256.0   1.0  27.872000    69.199994   65.632001
37          8.0        256.0   2.0  26.591999    63.040003   65.632001
38          8.0        256.0   4.0  27.872000    55.744000   65.664001
39          8.0        256.0   8.0  27.872000    69.888003   65.696001
40         16.0          8.0   1.0  20.352000    58.944002   14.592000
41         16.0          8.0   2.0  20.352000    53.440001   14.560000
42         16.0          8.0   4.0  19.200001    55.904001   14.720000
43         16.0          8.0   8.0  20.416001    71.744002   14.944000
44         16.0         32.0   1.0  21.600001    65.295994   16.640000
45         16.0         32.0   2.0  20.064000    66.111997   16.576000
46         16.0         32.0   4.0  20.160001    58.543999   16.704001
47         16.0         32.0   8.0  21.600001   156.992003   16.960001
48         16.0         64.0   1.0  21.376001    63.135996   20.288000
49         16.0         64.0   2.0  22.752000    61.792001   20.256000
50         16.0         64.0   4.0  21.344000    63.792005   20.320000
51         16.0         64.0   8.0  21.824000    57.535999   20.416001
52         16.0        128.0   1.0  23.040000    70.175998   31.120000
53         16.0        128.0   2.0  24.224000    63.936003   31.040000
54         16.0        128.0   4.0  23.135999    68.640001   31.168001
55         16.0        128.0   8.0  23.232000    57.119999   31.231999
56         16.0        256.0   1.0  27.904000    66.239998   65.696001
57         16.0        256.0   2.0  26.976001    68.191998   65.664001
58         16.0        256.0   4.0  26.815999    57.280000   65.664001
59         16.0        256.0   8.0  27.968001    62.688001   65.888003
60         32.0          8.0   1.0  20.320000    68.127997   14.560000
61         32.0          8.0   2.0  20.320000    63.167997   14.720000
62         32.0          8.0   4.0  20.352000    56.063998   14.976000
63         32.0          8.0   8.0  19.040000    68.223998   15.552000
64         32.0         32.0   1.0  20.064000    61.280001   16.608000
65         32.0         32.0   2.0  21.663999    59.776001   16.704001
66         32.0         32.0   4.0  21.504000    55.968001   16.992001
67         32.0         32.0   8.0  20.160001    64.000003   17.344000
68         32.0         64.0   1.0  21.824000    61.951999   20.223999
69         32.0         64.0   2.0  21.856001    60.896002   20.320000
70         32.0         64.0   4.0  21.856001    54.048002   20.416001
71         32.0         64.0   8.0  21.632001    72.704002   20.864001
72         32.0        128.0   1.0  23.072001    65.888003   31.136001
73         32.0        128.0   2.0  22.944000    60.063999   31.104000
74         32.0        128.0   4.0  23.072001    70.335999   31.199999
75         32.0        128.0   8.0  23.391999    60.288001   31.328000
76         32.0        256.0   1.0  27.872000    67.456000   65.728001
77         32.0        256.0   2.0  27.775999    67.071997   65.664001
78         32.0        256.0   4.0  27.872000    61.567999   65.792002
79         32.0        256.0   8.0  26.815999    69.023997   65.920003
80         64.0          8.0   1.0  20.479999    58.272000   14.720000
81         64.0          8.0   2.0  20.576000    65.087996   14.944000
82         64.0          8.0   4.0  20.064000    65.375999   15.552000
83         64.0          8.0   8.0  19.424001    72.959997   16.960001
84         64.0         32.0   1.0  21.600001    68.624005   16.736001
85         64.0         32.0   2.0  21.695999    67.376003   16.928000
86         64.0         32.0   4.0  21.504000    64.095996   17.344000
87         64.0         32.0   8.0  21.856001    64.159997   19.200001
88         64.0         64.0   1.0  21.600001    70.367999   20.320000
89         64.0         64.0   2.0  22.848001    68.847999   20.447999
90         64.0         64.0   4.0  21.952000    62.912002   20.800000
91         64.0         64.0   8.0  22.368001    65.728001   21.056000
92         64.0        128.0   1.0  23.167999    75.471997   31.136001
93         64.0        128.0   2.0  23.232000    32.575998   31.296000
94         64.0        128.0   4.0  23.391999    67.727998   31.296000
95         64.0        128.0   8.0  24.383999    60.672000   31.968001
96         64.0        256.0   1.0  26.591999    74.047998   65.728001
97         64.0        256.0   2.0  27.872000    66.463999   65.888003
98         64.0        256.0   4.0  28.063999    64.095996   65.888003
99         64.0        256.0   8.0  26.591999    69.824003   66.111997
100       128.0          8.0   1.0  19.168001    68.896003   14.944000
101       128.0          8.0   2.0  20.096000    60.768001   15.568000
102       128.0          8.0   4.0  20.608000    67.039996   16.928000
103       128.0          8.0   8.0  20.191999    69.632001   20.864001
104       128.0         32.0   1.0  21.536000    67.135997   16.960001
105       128.0         32.0   2.0  21.504000    68.000004   17.344000
106       128.0         32.0   4.0  22.016000    64.223997   19.231999
107       128.0         32.0   8.0  21.888001    68.672001   23.296000
108       128.0         64.0   1.0  22.879999    73.504001   20.416001
109       128.0         64.0   2.0  21.663999    59.712000   20.832000
110       128.0         64.0   4.0  21.728000    66.624001   21.056000
111       128.0         64.0   8.0  21.280000    68.832003   22.560000
112       128.0        128.0   1.0  23.296000    60.095999   31.231999
113       128.0        128.0   2.0  22.911999    70.015997   31.296000
114       128.0        128.0   4.0  24.416000    70.464000   32.032002
115       128.0        128.0   8.0  23.328001    71.327999   32.960001
116       128.0        256.0   1.0  27.039999   216.959998   65.888003
117       128.0        256.0   2.0  28.031999    65.743998   65.856002
118       128.0        256.0   4.0  27.872000    69.632001   66.207998
119       128.0        256.0   8.0  27.008001    68.688005   66.463999
120       256.0          8.0   1.0  20.128001    58.143999   15.536000
121       256.0          8.0   2.0  20.671999    63.231997   16.960001
122       256.0          8.0   4.0  20.223999    65.600000   20.896001
123       256.0          8.0   8.0  21.695999    71.744002   28.287999
124       256.0         32.0   1.0  21.632001    59.103999   17.344000
125       256.0         32.0   2.0  22.016000    29.056000   19.231999
126       256.0         32.0   4.0  20.927999    68.159997   23.264000
127       256.0         32.0   8.0  22.336001    67.520000   30.432001
128       256.0         64.0   1.0  21.632001    29.888000   20.864001
129       256.0         64.0   2.0  22.592001    58.591999   21.088000
130       256.0         64.0   4.0  21.728000    58.623999   22.399999
131       256.0         64.0   8.0  23.744000    66.367999   27.680000
132       256.0        128.0   1.0  23.391999    65.600000   31.296000
133       256.0        128.0   2.0  23.135999    63.023999   32.000002
134       256.0        128.0   4.0  24.704000    68.031996   32.960001
135       256.0        128.0   8.0  23.744000    61.919998   36.031999
136       256.0        256.0   1.0  28.224001    68.768002   65.920003
137       256.0        256.0   2.0  26.591999    65.952003   66.192001
138       256.0        256.0   4.0  27.807999    52.064002   66.479996
139       256.0        256.0   8.0  28.896000    58.784001   68.191998
140       512.0          8.0   1.0  19.231999    69.408000   16.960001
141       512.0          8.0   2.0  19.455999    58.432002   20.800000
142       512.0          8.0   4.0  20.512000    69.215998   28.320000
143       512.0          8.0   8.0  21.056000   114.047997   42.080000
144       512.0         32.0   1.0  21.919999    59.119999   19.200001
145       512.0         32.0   2.0  20.992000    59.071999   23.264000
146       512.0         32.0   4.0  21.536000    58.079999   30.400001
147       512.0         32.0   8.0  23.072001    59.840001   44.512000
148       512.0         64.0   1.0  22.464000    58.304001   21.136001
149       512.0         64.0   2.0  21.663999    32.256000   22.528000
150       512.0         64.0   4.0  23.424000    50.687999   28.112000
151       512.0         64.0   8.0  23.808001    63.519999   41.184001
152       512.0        128.0   1.0  23.040000    59.840001   31.968001
153       512.0        128.0   2.0  23.296000    65.888003   32.960001
154       512.0        128.0   4.0  23.647999    69.327995   35.872001
155       512.0        128.0   8.0  24.192000    69.343999   43.040000
156       512.0        256.0   1.0  26.848000    69.087997   66.111997
157       512.0        256.0   2.0  26.912000    70.319995   66.399999
158       512.0        256.0   4.0  28.928000    56.832001   68.159997
159       512.0        256.0   8.0  28.928000    65.952003   71.392000
160      1024.0          8.0   1.0  20.223999    73.151998   20.896001
161      1024.0          8.0   2.0  21.632001    69.087997   28.352000
162      1024.0          8.0   4.0  22.048000   113.696001   42.032000
163      1024.0          8.0   8.0  25.504000   206.880003   71.039997
164      1024.0         32.0   1.0  21.952000    63.648000   23.264000
165      1024.0         32.0   2.0  21.504000    66.512004   30.400001
166      1024.0         32.0   4.0  22.080000    63.167997   44.480000
167      1024.0         32.0   8.0  24.752000   166.495994   78.368001
168      1024.0         64.0   1.0  21.183999    69.728002   22.464000
169      1024.0         64.0   2.0  23.488000    57.376001   27.664000
170      1024.0         64.0   4.0  22.879999    57.744000   40.959999
171      1024.0         64.0   8.0  26.303999    70.464000   65.183997
172      1024.0        128.0   1.0  23.776000    68.095997   32.864001
173      1024.0        128.0   2.0  23.680000    62.112000   35.824001
174      1024.0        128.0   4.0  24.544001    57.599999   43.040000
175      1024.0        128.0   8.0  27.616000    67.071997   55.135999
176      1024.0        256.0   1.0  26.912000    55.408001   66.463999
177      1024.0        256.0   2.0  27.456000    68.448000   68.191998
178      1024.0        256.0   4.0  27.968001    60.768001   71.295999
179      1024.0        256.0   8.0  30.304000    62.080000   81.887998
180      2048.0          8.0   1.0  20.256000    69.151998   28.320000
181      2048.0          8.0   2.0  20.927999   113.920003   42.048000
182      2048.0          8.0   4.0  25.504000   207.039997   71.071997
183      2048.0          8.0   8.0  32.224000   378.208011  127.519995
184      2048.0         32.0   1.0  22.464000    63.519999   30.432001
185      2048.0         32.0   2.0  23.040000    62.272001   44.447999
186      2048.0         32.0   4.0  23.488000    72.512001   79.392001
187      2048.0         32.0   8.0  29.152000   119.616002  146.559998
188      2048.0         64.0   1.0  23.712000    59.424002   27.295999
189      2048.0         64.0   2.0  22.976000    64.640000   40.927999
190      2048.0         64.0   4.0  25.184000    57.472002   65.183997
191      2048.0         64.0   8.0  31.936001    75.903997  110.271998
192      2048.0        128.0   1.0  23.744000    65.952003   35.744000
193      2048.0        128.0   2.0  25.472000    53.984001   43.072000
194      2048.0        128.0   4.0  26.368000    58.848001   55.135999
195      2048.0        128.0   8.0  32.191999    62.080000   79.328001
196      2048.0        256.0   1.0  28.896000    48.416000   68.223998
197      2048.0        256.0   2.0  27.968001    66.111997   71.359999
198      2048.0        256.0   4.0  31.199999    59.872001   81.791997
199      2048.0        256.0   8.0  37.216000    66.111997   98.272003
200      4096.0          8.0   1.0  20.927999   113.920003   42.048000
201      4096.0          8.0   2.0  25.504000   206.624001   71.039997
202      4096.0          8.0   4.0  31.136001   378.224015  127.424002
203      4096.0          8.0   8.0  43.839999   729.439974  260.639995
204      4096.0         32.0   1.0  21.952000    60.112000   44.544000
205      4096.0         32.0   2.0  23.520000    72.255999   78.847997
206      4096.0         32.0   4.0  29.152000   119.616002  147.872001
207      4096.0         32.0   8.0  46.144001   211.807996  294.223994
208      4096.0         64.0   1.0  22.848001    59.551999   40.991999
209      4096.0         64.0   2.0  25.376000    58.688000   65.279998
210      4096.0         64.0   4.0  30.975999    75.935997  110.207997
211      4096.0         64.0   8.0  49.504001   122.432001  200.703993
212      4096.0        128.0   1.0  24.544001    71.392000   43.040000
213      4096.0        128.0   2.0  27.488001    48.799999   55.232000
214      4096.0        128.0   4.0  32.224000    60.192000   79.328001
215      4096.0        128.0   8.0  49.311999    83.839998  129.695997
216      4096.0        256.0   1.0  28.063999    60.192000   71.263999
217      4096.0        256.0   2.0  30.272000    67.744002   81.791997
218      4096.0        256.0   4.0  36.928002    68.432003   98.240003
219      4096.0        256.0   8.0  49.472000    78.560002  132.175997
220      8192.0          8.0   1.0  26.880000   206.944004   71.039997
221      8192.0          8.0   2.0  32.095999   378.127992  127.568007
222      8192.0          8.0   4.0  43.968000   730.080009  260.320008
223      8192.0          8.0   8.0  69.408000  1435.008049  507.327974
224      8192.0         32.0   1.0  23.520000    72.832003   78.656003
225      8192.0         32.0   2.0  29.216001   119.935997  147.072002
226      8192.0         32.0   4.0  47.072001   212.495998  291.584015
227      8192.0         32.0   8.0  71.071997   386.880010  612.031996
228      8192.0         64.0   1.0  26.591999    53.760000   65.311998
229      8192.0         64.0   2.0  30.944001    76.063998  110.111997
230      8192.0         64.0   4.0  49.791999   122.528002  200.800002
231      8192.0         64.0   8.0  77.632003   215.792000  382.863998
232      8192.0        128.0   1.0  26.400000    60.031999   55.167999
233      8192.0        128.0   2.0  32.448001    60.640000   79.392001
234      8192.0        128.0   4.0  49.472000    83.871998  129.728004
235      8192.0        128.0   8.0  73.792003   131.487995  228.960007
236      8192.0        256.0   1.0  30.368000    62.928006   81.823997
237      8192.0        256.0   2.0  37.248001    65.792002   98.112002
238      8192.0        256.0   4.0  50.592002    78.688003  132.128000
239      8192.0        256.0   8.0  75.328000   105.439998  448.480010

pr:

     num_tokens  num_experts  topk        SGL       Triton        VLLM
0           1.0          8.0   1.0  13.680000    52.671999   13.792000
1           1.0          8.0   2.0  13.664000    53.215999   13.824000
2           1.0          8.0   4.0  13.696000    53.824000   13.856000
3           1.0          8.0   8.0  13.728000    52.384000   13.856000
4           1.0         32.0   1.0  16.416000    55.488002   16.287999
5           1.0         32.0   2.0  16.448000    54.432001   16.416000
6           1.0         32.0   4.0  16.448000    53.024001   16.352000
7           1.0         32.0   8.0  16.448000    53.584002   16.384000
8           1.0         64.0   1.0  19.680001    49.088001   19.520000
9           1.0         64.0   2.0  19.680001    50.496001   19.392001
10          1.0         64.0   4.0  19.711999    50.976001   19.487999
11          1.0         64.0   8.0  19.711999    51.679999   19.584000
12          1.0        128.0   1.0  20.896001    53.376000   30.688001
13          1.0        128.0   2.0  20.959999    50.655998   30.751999
14          1.0        128.0   4.0  20.927999    51.743999   30.751999
15          1.0        128.0   8.0  20.992000    51.840000   30.784000
16          1.0        256.0   1.0  24.000000    53.328000   64.672001
17          1.0        256.0   2.0  24.000000    52.016001   64.672001
18          1.0        256.0   4.0  24.032000    54.175999   64.704001
19          1.0        256.0   8.0  24.032000    51.472001   64.832002
20          8.0          8.0   1.0  13.696000    51.743999   13.824000
21          8.0          8.0   2.0  13.824000    54.048002   13.856000
22          8.0          8.0   4.0  13.760000    52.671999   13.792000
23          8.0          8.0   8.0  14.272000    53.760000   13.920000
24          8.0         32.0   1.0  16.416000    54.240000   16.416000
25          8.0         32.0   2.0  16.480001    53.408001   16.384000
26          8.0         32.0   4.0  16.480001    54.816000   16.287999
27          8.0         32.0   8.0  16.704001    53.856000   16.384000
28          8.0         64.0   1.0  19.743999    54.719999   19.520000
29          8.0         64.0   2.0  19.711999    53.360000   19.552000
30          8.0         64.0   4.0  19.711999    53.631999   19.504000
31          8.0         64.0   8.0  19.743999    53.631999   19.584000
32          8.0        128.0   1.0  20.959999    50.159998   30.784000
33          8.0        128.0   2.0  20.992000    73.215999   30.751999
34          8.0        128.0   4.0  21.024000    55.472001   30.784000
35          8.0        128.0   8.0  21.088000    55.647999   30.848000
36          8.0        256.0   1.0  24.095999    50.783999   64.704001
37          8.0        256.0   2.0  24.111999    53.103998   64.832002
38          8.0        256.0   4.0  24.127999    54.016002   64.928003
39          8.0        256.0   8.0  24.160000    54.143999   64.832002
40         16.0          8.0   1.0  13.824000    51.008001   13.856000
41         16.0          8.0   2.0  13.760000    53.376000   13.792000
42         16.0          8.0   4.0  14.272000    51.872000   13.888000
43         16.0          8.0   8.0  15.232000    54.528002   14.176000
44         16.0         32.0   1.0  16.384000    51.808000   16.480001
45         16.0         32.0   2.0  16.448000    52.175999   16.352000
46         16.0         32.0   4.0  16.704001    52.607998   16.448000
47         16.0         32.0   8.0  17.696001    54.719999   16.511999
48         16.0         64.0   1.0  19.743999    51.679999   19.552000
49         16.0         64.0   2.0  19.727999    51.456001   19.520000
50         16.0         64.0   4.0  19.711999    49.856000   19.520000
51         16.0         64.0   8.0  20.191999    53.952001   19.743999
52         16.0        128.0   1.0  21.024000    51.711999   30.751999
53         16.0        128.0   2.0  21.008000    53.344000   30.784000
54         16.0        128.0   4.0  21.040000    53.280000   30.784000
55         16.0        128.0   8.0  21.088000    54.480001   30.880000
56         16.0        256.0   1.0  24.160000    54.111999   64.864002
57         16.0        256.0   2.0  24.127999    53.376000   64.800002
58         16.0        256.0   4.0  24.160000    53.151999   64.832002
59         16.0        256.0   8.0  24.192000    54.240000   64.992003
60         32.0          8.0   1.0  13.792000    53.056002   13.824000
61         32.0          8.0   2.0  14.272000    50.432000   13.936000
62         32.0          8.0   4.0  15.232000    49.408000   14.144000
63         32.0          8.0   8.0  17.216001    53.920001   14.784000
64         32.0         32.0   1.0  16.480001    54.400001   16.319999
65         32.0         32.0   2.0  16.736001    53.247999   16.319999
66         32.0         32.0   4.0  17.696001    54.880001   16.511999
67         32.0         32.0   8.0  19.648001    55.328000   17.023999
68         32.0         64.0   1.0  19.711999    53.776000   19.520000
69         32.0         64.0   2.0  19.743999    53.631999   19.552000
70         32.0         64.0   4.0  20.191999    54.304000   19.711999
71         32.0         64.0   8.0  21.024000    54.432001   19.936001
72         32.0        128.0   1.0  21.024000    51.167998   30.751999
73         32.0        128.0   2.0  21.056000    50.912000   30.784000
74         32.0        128.0   4.0  21.088000    50.528001   30.848000
75         32.0        128.0   8.0  21.312000    54.687999   31.072000
76         32.0        256.0   1.0  24.127999    53.663999   64.832002
77         32.0        256.0   2.0  24.127999    52.384000   64.864002
78         32.0        256.0   4.0  24.127999    51.056001   64.976007
79         32.0        256.0   8.0  24.544001    53.952001   65.119997
80         64.0          8.0   1.0  14.272000    55.520002   13.920000
81         64.0          8.0   2.0  15.264000    54.623999   14.208000
82         64.0          8.0   4.0  17.279999    52.512001   14.784000
83         64.0          8.0   8.0  21.280000    53.344000   16.448000
84         64.0         32.0   1.0  16.704001    51.552001   16.416000
85         64.0         32.0   2.0  17.696001    52.960001   16.511999
86         64.0         32.0   4.0  19.616000    52.255999   17.023999
87         64.0         32.0   8.0  23.391999    54.848000   18.400000
88         64.0         64.0   1.0  19.760000    56.191999   19.552000
89         64.0         64.0   2.0  20.191999    53.087998   19.711999
90         64.0         64.0   4.0  21.024000    54.127999   19.840000
91         64.0         64.0   8.0  23.615999    56.031998   20.832000
92         64.0        128.0   1.0  21.056000    54.992002   30.784000
93         64.0        128.0   2.0  21.120001    52.448001   30.848000
94         64.0        128.0   4.0  21.312000    50.207999   31.040000
95         64.0        128.0   8.0  21.312000    55.567998   31.360000
96         64.0        256.0   1.0  24.160000    52.960001   64.832002
97         64.0        256.0   2.0  24.192000    52.480001   64.992003
98         64.0        256.0   4.0  24.576001    52.896000   65.072000
99         64.0        256.0   8.0  24.736000    53.376000   65.632001
100       128.0          8.0   1.0  15.264000    52.255999   14.144000
101       128.0          8.0   2.0  17.216001    50.624002   14.784000
102       128.0          8.0   4.0  21.280000    50.175998   16.384000
103       128.0          8.0   8.0  17.952001    54.207999   20.416001
104       128.0         32.0   1.0  17.696001    50.687999   16.511999
105       128.0         32.0   2.0  19.664001    49.504001   17.055999
106       128.0         32.0   4.0  23.391999    48.960000   18.368000
107       128.0         32.0   8.0  18.368000    55.039998   22.272000
108       128.0         64.0   1.0  20.160001    50.848000   19.743999
109       128.0         64.0   2.0  21.056000    51.263999   19.904001
110       128.0         64.0   4.0  23.552001    51.008001   20.832000
111       128.0         64.0   8.0  19.711999    52.127998   22.352001
112       128.0        128.0   1.0  21.088000    50.592002   30.880000
113       128.0        128.0   2.0  21.312000    51.008001   31.024000
114       128.0        128.0   4.0  21.280000    54.559998   31.360000
115       128.0        128.0   8.0  21.120001    52.448001   32.000002
116       128.0        256.0   1.0  24.192000    49.920000   64.960003
117       128.0        256.0   2.0  24.576001    49.823999   65.056004
118       128.0        256.0   4.0  24.800001    49.120001   65.632001
119       128.0        256.0   8.0  24.992000    54.175999   66.399999
120       256.0          8.0   1.0  17.247999    52.576002   14.784000
121       256.0          8.0   2.0  21.280000    51.295999   16.416000
122       256.0          8.0   4.0  17.983999    51.663999   20.416001
123       256.0          8.0   8.0  18.400000    67.263998   27.264001
124       256.0         32.0   1.0  19.616000    54.687999   17.055999
125       256.0         32.0   2.0  23.424000    54.095998   18.400000
126       256.0         32.0   4.0  18.368000    53.247999   22.304000
127       256.0         32.0   8.0  19.072000    54.352000   29.536000
128       256.0         64.0   1.0  21.120001    54.304000   19.872000
129       256.0         64.0   2.0  23.584001    54.623999   20.864001
130       256.0         64.0   4.0  19.743999    54.272000   22.352001
131       256.0         64.0   8.0  20.352000    54.464001   26.624000
132       256.0        128.0   1.0  21.312000    53.792000   31.072000
133       256.0        128.0   2.0  21.312000    55.936001   31.360000
134       256.0        128.0   4.0  21.120001    53.888001   32.000002
135       256.0        128.0   8.0  21.984000    52.896000   35.615999
136       256.0        256.0   1.0  24.576001    52.352000   65.087996
137       256.0        256.0   2.0  24.815999    51.904000   65.568000
138       256.0        256.0   4.0  25.024001    50.880000   66.303998
139       256.0        256.0   8.0  25.280001    53.824000   67.231998
140       512.0          8.0   1.0  21.280000    50.864000   16.416000
141       512.0          8.0   2.0  18.015999    51.520001   20.416001
142       512.0          8.0   4.0  18.368000    67.167997   27.232001
143       512.0          8.0   8.0  20.000000   116.672002   41.632000
144       512.0         32.0   1.0  23.424000    51.008001   18.400000
145       512.0         32.0   2.0  18.368000    51.936001   22.304000
146       512.0         32.0   4.0  19.104000    50.687999   29.503999
147       512.0         32.0   8.0  20.256000    54.400001   43.648001
148       512.0         64.0   1.0  23.615999    51.520001   20.864001
149       512.0         64.0   2.0  19.743999    51.775999   22.368001
150       512.0         64.0   4.0  20.352000    52.416001   26.048001
151       512.0         64.0   8.0  21.663999    52.320000   40.576000
152       512.0        128.0   1.0  21.344000    52.832000   31.360000
153       512.0        128.0   2.0  21.120001    52.223999   31.968001
154       512.0        128.0   4.0  21.952000    51.615998   35.503998
155       512.0        128.0   8.0  23.072001    52.864000   42.399999
156       512.0        256.0   1.0  24.800001    51.456001   65.632001
157       512.0        256.0   2.0  24.960000    51.711999   66.335998
158       512.0        256.0   4.0  25.343999    49.856000   67.167997
159       512.0        256.0   8.0  26.815999    54.880001   70.816003
160      1024.0          8.0   1.0  18.031999    52.512001   20.416001
161      1024.0          8.0   2.0  18.432001    67.135997   27.327999
162      1024.0          8.0   4.0  20.032000   116.640002   41.632000
163      1024.0          8.0   8.0  21.952000   205.136001   69.760002
164      1024.0         32.0   1.0  18.368000    55.071998   22.304000
165      1024.0         32.0   2.0  19.040000    55.583999   29.536000
166      1024.0         32.0   4.0  20.256000    55.264000   43.712001
167      1024.0         32.0   8.0  23.104001    71.744002   77.408001
168      1024.0         64.0   1.0  19.760000    56.031998   22.304000
169      1024.0         64.0   2.0  20.368000    54.095998   26.144000
170      1024.0         64.0   4.0  21.648001    55.103999   40.544000
171      1024.0         64.0   8.0  24.224000    58.272000   63.904002
172      1024.0        128.0   1.0  21.088000    56.848001   32.000002
173      1024.0        128.0   2.0  21.984000    55.551998   35.583999
174      1024.0        128.0   4.0  23.024000    55.808000   42.367999
175      1024.0        128.0   8.0  24.992000    56.127999   54.111999
176      1024.0        256.0   1.0  25.072001    52.480001   66.431999
177      1024.0        256.0   2.0  25.264001    52.576002   67.199998
178      1024.0        256.0   4.0  26.848000    52.416001   70.703998
179      1024.0        256.0   8.0  29.120000    57.663999   81.055999
180      2048.0          8.0   1.0  18.432001    67.231998   27.200000
181      2048.0          8.0   2.0  20.048000   116.896003   41.664001
182      2048.0          8.0   4.0  22.000000   205.055997   69.696002
183      2048.0          8.0   8.0  27.039999   376.767993  126.463994
184      2048.0         32.0   1.0  19.072000    52.960001   29.600000
185      2048.0         32.0   2.0  20.223999    55.360001   43.648001
186      2048.0         32.0   4.0  23.135999    71.904004   77.664003
187      2048.0         32.0   8.0  27.968001   118.336000  146.559998
188      2048.0         64.0   1.0  20.352000    52.832000   26.272001
189      2048.0         64.0   2.0  21.695999    53.215999   40.160000
190      2048.0         64.0   4.0  24.224000    54.207999   63.840002
191      2048.0         64.0   8.0  30.239999    75.328000  109.375998
192      2048.0        128.0   1.0  22.016000    52.096002   35.519999
193      2048.0        128.0   2.0  23.040000    52.592002   42.304002
194      2048.0        128.0   4.0  24.960000    50.048001   54.175999
195      2048.0        128.0   8.0  30.144000    75.424001   78.783996
196      2048.0        256.0   1.0  25.343999    50.944000   67.199998
197      2048.0        256.0   2.0  26.784001    52.639998   70.752002
198      2048.0        256.0   4.0  29.088000    58.463998   80.799997
199      2048.0        256.0   8.0  34.655999    66.335998   97.695999
200      4096.0          8.0   1.0  20.032000   116.640002   41.664001
201      4096.0          8.0   2.0  22.016000   205.280006   69.824003
202      4096.0          8.0   4.0  27.008001   377.023995  126.463994
203      4096.0          8.0   8.0  36.127999   731.648028  258.928001
204      4096.0         32.0   1.0  20.256000    52.255999   43.648001
205      4096.0         32.0   2.0  23.104001    71.888000   77.600002
206      4096.0         32.0   4.0  28.063999   118.368000  146.080002
207      4096.0         32.0   8.0  38.240001   211.135998  293.184012
208      4096.0         64.0   1.0  21.695999    54.079998   40.415999
209      4096.0         64.0   2.0  24.224000    54.752000   63.808002
210      4096.0         64.0   4.0  30.239999    75.296000  109.311998
211      4096.0         64.0   8.0  40.959999   122.336000  199.456006
212      4096.0        128.0   1.0  23.008000    53.663999   42.399999
213      4096.0        128.0   2.0  24.928000    51.488001   54.048002
214      4096.0        128.0   4.0  30.112000    57.920001   78.720003
215      4096.0        128.0   8.0  39.648000    82.111999  129.168004
216      4096.0        256.0   1.0  26.815999    52.096002   70.784003
217      4096.0        256.0   2.0  29.152000    57.888001   80.831997
218      4096.0        256.0   4.0  34.623999    66.175997   98.016001
219      4096.0        256.0   8.0  40.704001    79.135999  131.648004
220      8192.0          8.0   1.0  22.080000   205.375999   69.792002
221      8192.0          8.0   2.0  27.039999   377.456009  126.432002
222      8192.0          8.0   4.0  36.127999   732.128024  258.783996
223      8192.0          8.0   8.0  54.912001  1438.848019  507.583976
224      8192.0         32.0   1.0  23.135999    71.904004   77.600002
225      8192.0         32.0   2.0  28.031999   118.368000  146.944001
226      8192.0         32.0   4.0  38.304001   211.104006  294.528008
227      8192.0         32.0   8.0  57.599999   384.736001  605.727971
228      8192.0         64.0   1.0  24.224000    57.952002   63.904002
229      8192.0         64.0   2.0  30.208001    75.392000  109.471999
230      8192.0         64.0   4.0  40.927999   122.239999  199.680001
231      8192.0         64.0   8.0  62.816001   215.360001  381.664008
232      8192.0        128.0   1.0  25.024001    53.472001   54.079998
233      8192.0        128.0   2.0  30.144000    57.824001   78.688003
234      8192.0        128.0   4.0  39.551999    81.983998  129.152000
235      8192.0        128.0   8.0  57.792000   128.928006  228.064001
236      8192.0        256.0   1.0  29.152000    58.143999   80.959998
237      8192.0        256.0   2.0  34.623999    66.303998   97.824000
238      8192.0        256.0   4.0  40.672000    78.911997  131.456003
239      8192.0        256.0   8.0  56.992002   101.728000  448.576003

FlamingoPg · 2025-04-05T15:51:44Z

cool

zhyncs · 2025-04-06T07:44:47Z

@BBuf Can you successfully use the tuning script after this change?

BBuf · 2025-04-06T11:40:48Z

@BBuf Can you successfully use the tuning script after this change?

It's ok. Tuning bug has been fixed in pr 4918

BBuf · 2025-04-06T13:49:55Z

@zhyncs @merrymercy Now, I have fixed all the bug and performace bug in sgl_kernel moe_align_block_size kernel. It can be seen on the H200 benchmark that the performance in all scenarios is now on par with or better than VLLM. As the number of tokens and the number of experts increase, the speedup further improves. Once the CI passed, we can merge it and I can change the cumsum_buffer allocate from torch.zeros to torch.empty in fused_moe_triton.py when new sgl-kernel version released.

sgl-kernel/csrc/moe/moe_align_kernel.cu

sgl-kernel/benchmark/bench_moe_align_block_size.py

sgl-kernel/csrc/moe/moe_align_kernel.cu

…gl-project/sglang into opt_moe_align_block_kernel_small_batch

BBuf · 2025-04-10T01:28:53Z

    num_tokens  num_experts  topk         SGL        Triton         VLLM
0      16384.0          8.0   1.0   27.744001    377.983987   126.864001
1      16384.0          8.0   2.0   36.991999    733.215988   259.519994
2      16384.0          8.0   4.0   55.583999   1440.768003   508.512020
3      16384.0          8.0   8.0   92.384003   2859.999895  1001.136065
4      16384.0         32.0   1.0   28.767999    119.584002   146.592006
5      16384.0         32.0   2.0   38.943999    212.543994   295.103997
6      16384.0         32.0   4.0   58.304001    385.760009   608.287990
7      16384.0         32.0   8.0   98.463997    745.248020  1175.904036
8      16384.0         64.0   1.0   30.944001     76.448001   109.856002
9      16384.0         64.0   2.0   41.760001    123.744003   200.320005
10     16384.0         64.0   4.0   63.584000    216.831997   382.048011
11     16384.0         64.0   8.0  107.199997    390.175998   755.392015
12     16384.0        128.0   1.0   30.912001     59.136000    79.296000
13     16384.0        128.0   2.0   40.383998     83.360001   129.728004
14     16384.0        128.0   4.0   58.559999    129.951999   228.271991
15     16384.0        128.0   8.0   96.032001    223.072007   426.912010
16     16384.0        256.0   1.0   35.392001     67.327999    98.463997
17     16384.0        256.0   2.0   41.439999     80.192000   131.871998
18     16384.0        256.0   4.0   57.760000    102.816001   448.000014
19     16384.0        256.0   8.0   90.559997    152.288005   762.592018
20     32768.0          8.0   1.0   37.216000    738.048017   259.552002
21     32768.0          8.0   2.0   55.536002   1451.167941   507.488012
22     32768.0          8.0   4.0   92.896000   2860.960007  1002.847910
23     32768.0          8.0   8.0  169.760004   5724.895954  1990.640044
24     32768.0         32.0   1.0   39.039999    212.543994   293.888003
25     32768.0         32.0   2.0   58.688000    386.511981   610.943973
26     32768.0         32.0   4.0   98.623998    745.696008  1178.751945
27     32768.0         32.0   8.0  180.319995   1468.608022  2393.248081
28     32768.0         64.0   1.0   42.048000    123.039998   200.703993
29     32768.0         64.0   2.0   63.231997    216.352001   382.432014
30     32768.0         64.0   4.0  107.648000    390.464008   751.215994
31     32768.0         64.0   8.0  199.072003    752.991974  1491.999984
32     32768.0        128.0   1.0   40.320002     83.376005   129.567996
33     32768.0        128.0   2.0   58.816001    130.991995   228.863999
34     32768.0        128.0   4.0   96.064001    223.744005   427.583992
35     32768.0        128.0   8.0  172.992006    399.744004   844.015956
36     32768.0        256.0   1.0   41.471999     78.783996   131.935999
37     32768.0        256.0   2.0   57.696000    103.200004   449.088007
38     32768.0        256.0   4.0   90.655997    152.383998   763.616025
39     32768.0        256.0   8.0  159.904003    249.952003  1397.055984
40     65536.0          8.0   1.0   55.520002   1451.712012   507.856011
41     65536.0          8.0   2.0   92.799999   2862.272024  1002.287984
42     65536.0          8.0   4.0  168.799996   5728.703976  1989.536047
43     65536.0          8.0   8.0  315.903991  11395.999908  3962.704182
44     65536.0         32.0   1.0   58.816001    386.335999   609.535992
45     65536.0         32.0   2.0   98.495997    746.240020  1169.664025
46     65536.0         32.0   4.0  181.088001   1468.719959  2406.896114
47     65536.0         32.0   8.0  336.800009   2901.887894  4866.208076
48     65536.0         64.0   1.0   63.231997    216.639996   382.016003
49     65536.0         64.0   2.0  107.327998    390.560001   752.560019
50     65536.0         64.0   4.0  198.080003    752.416015  1490.944028
51     65536.0         64.0   8.0  372.415990   1472.352028  2908.512115
52     65536.0        128.0   1.0   58.975998    130.559996   228.720009
53     65536.0        128.0   2.0   96.064001    223.583996   428.351998
54     65536.0        128.0   4.0  174.464002    399.792016   842.736006
55     65536.0        128.0   8.0  322.880000    761.391997  1638.512015
56     65536.0        256.0   1.0   57.728000    103.391998   447.551996
57     65536.0        256.0   2.0   90.767995    152.160004   763.599992
58     65536.0        256.0   4.0  158.672005    249.952003  1397.632003
59     65536.0        256.0   8.0  289.600015    429.791987  2670.127869
60    131072.0          8.0   1.0   92.735998   2864.223957  1001.824021
61    131072.0          8.0   2.0  168.768004   5724.736214  1993.952036
62    131072.0          8.0   4.0  315.744013  11396.672249  3963.071823
63    131072.0          8.0   8.0  606.112003  22729.358673  7922.160149
64    131072.0         32.0   1.0   98.527998    746.016026  1176.031947
65    131072.0         32.0   2.0  181.024000   1468.736053  2392.623901
66    131072.0         32.0   4.0  336.479992   2903.136015  4890.624046
67    131072.0         32.0   8.0  647.647977   5768.127918  9744.992256
68    131072.0         64.0   1.0  107.423998    390.464008   752.864003
69    131072.0         64.0   2.0  198.783994    752.255976  1490.175962
70    131072.0         64.0   4.0  372.447997   1472.336054  2910.624027
71    131072.0         64.0   8.0  717.664003   2899.647951  5952.688217
72    131072.0        128.0   1.0   96.256003    224.127993   427.648008
73    131072.0        128.0   2.0  173.439994    399.776012   843.312025
74    131072.0        128.0   4.0  322.784007    761.744022  1638.623953
75    131072.0        128.0   8.0  617.088020   1481.727958  3243.520021
76    131072.0        256.0   1.0   90.751998    152.416006   763.360023
77    131072.0        256.0   2.0  159.040004    249.727994  1398.591995
78    131072.0        256.0   4.0  289.503992    429.536015  2671.008110
79    131072.0        256.0   8.0  548.527956    797.551990  5192.447662

Now the kernel outperforms Triton and VLLM under any conditions of tokens, num_experts, and topk. Previously, it was slower than Triton when the token count was large, such as >=65536, because a counting loop had not been changed to a strided loop, resulting in non-contiguous memory access. cc @yiakwy-xpu-ml-framework-team

* main: (29 commits) reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086) Fix DeepSeek error when using DeepEP mode (sgl-project#5190) [metrics] Add in queue metrics (sgl-project#4444) fix: log warning when disable cuda graph (sgl-project#5209) Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196) sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207) update grok test (sgl-project#5171) model: support mllama4 (sgl-project#5144) [ci] fix ci test fused_moe op (sgl-project#5102) Support Llama4 fp8 inference (sgl-project#5194) Optimize topk operation in llama4 (sgl-project#5128) Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185) [Misc] clean up vllm in sgl-kernel test (sgl-project#5189) Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058) [DeepEP] fix: import buffer error (sgl-project#5179) fix: use DeepEPDispatcher on CUDA (sgl-project#5180) feat: add DeepGEMM build warning (sgl-project#5176) docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110) [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068) [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173) ... # Conflicts: # python/sglang/srt/disaggregation/mini_lb.py # python/sglang/srt/managers/scheduler.py

…ject#5086)

* Support with_stack and record_shapes in profiler (sgl-project#4740) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840) * Fix CI tests (sgl-project#4853) * Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855) * Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863) * [Feature] add multi-rank support for Lora (sgl-project#4492) Co-authored-by: rudy152 <czh1137892874@gmail.com> * Clean up `import vllm` in quantization/__init__.py (sgl-project#4834) * Fix wrong variable name when stopping memory profile (sgl-project#4772) * [Feat] support deepgemm for cmake (sgl-project#4864) * Make torch compile configurable for biased_grouped_topk (sgl-project#4749) * update sgl-kernel test ci (sgl-project#4866) * fix sampling issue (sgl-project#4871) * bump sgl-kernel 0.0.5.post4 (sgl-project#4768) * fix sgl-kernel cu118 build (sgl-project#4872) * [Feature] Support FA3 backend for MLA (sgl-project#4831) * upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873) * update torch compile doc (sgl-project#4874) * bump v0.4.4.post3 (sgl-project#4878) * Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882) * Improve stack trace of retry errors (sgl-project#4845) * Tiny fix doc error (sgl-project#4795) * [Docs] Update DeepGEMM at README.md (sgl-project#4886) * Update CODEOWNERS (sgl-project#4889) * Delete test_deep_gemm.py (sgl-project#4891) * Add deepseek style fused moe group gate selection kernel (sgl-project#4530) * quick fix: add default for new kernel (sgl-project#4898) * remove setup for sgl-kernel (sgl-project#4899) * [Misc] Clean m.def and add Development Tips (sgl-project#4890) * fix allreduce test (sgl-project#4909) * Support page size > 1 + eagle (sgl-project#4908) * Fix retract for page size > 1 (sgl-project#4914) * [Feature] use pytest for sgl-kernel (sgl-project#4896) * fix bmm fp8 (sgl-project#4926) * Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927) * Fix 2-gpu CI test and suppress some warnings (sgl-project#4930) * [feat] add fa3 in sgl-kernel (sgl-project#4902) Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> * Fix sglang frontend's incorrect dependency on torch (sgl-project#4931) * [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932) * cleanup sgl-kernel (sgl-project#4933) * [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925) * Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883) Co-authored-by: ch-wan <cwan39@gatech.edu> * [Fix] Add torch compile for torch.clamp back (sgl-project#4936) * Fix oom error for large page size (sgl-project#4913) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * [feat] interface for platforms abstraction (sgl-project#4928) * [Fix] revert clean m.def for cudagraph (sgl-project#4944) * refactor: multimodal data (sgl-project#4754) * bump sgl-kernel v0.0.6 (sgl-project#4950) * [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953) * use fa3 in sgl-kernel (sgl-project#4954) * Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959) * [Feature] Support DeepEP Low Latency (sgl-project#4767) Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: ch-wan <cwan39@gatech.edu> * update bench_serving (sgl-project#4958) * Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977) * [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: GeLee <leege233@gmail.com> * Large page size aligned hierarchical caching (sgl-project#4581) * bug fix for hicache host eviction (sgl-project#4989) * sgl scaled_fp8_quant support output padding (sgl-project#4861) * Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com> * Update tokenizer_manager.py (sgl-project#5008) * [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817) * update cutlass tag (sgl-project#5011) * Feature/revise docs ci (sgl-project#5009) * fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727) Co-authored-by: yuethe <yuethe@tencent.com> * [Build] Support build sgl-kernel with ccache (sgl-project#5020) * fix deepgemm as well (sgl-project#5030) * try to fix ci oserror (sgl-project#5024) * Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005) * Small refactor DeepEPMode to clean up code a bit (sgl-project#4992) * [Fix] fix fa3 build at cu118 (sgl-project#5036) * Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048) * bump sgl-kernel v0.0.7 (sgl-project#5046) * update eagle-3 docs (sgl-project#4796) Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> * Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * Update the retry count (sgl-project#5051) * upgrade sgl-kernel v0.0.7 (sgl-project#5049) * [2/3] fix dsv3 awq issue (sgl-project#4625) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> * Feature/revise docs ci (sgl-project#5056) * Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057) * [fix] remove `cuda_device_count_stateless` (sgl-project#5060) * Small refactor DeepEPDispatcher into subclasses (sgl-project#4994) * Support async DeepEP by splitting into two stages (sgl-project#4995) * Cleanup unused resources after DeepEP operation (sgl-project#4996) * Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918) * [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072) * fix dummy-load deepseekv2 (sgl-project#4535) * support sgl-kernel on blackwell (sgl-project#5074) * FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Chunan Zeng <zcnrex@gmail.com> * [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052) * upgrade transformers 4.51.0 (sgl-project#5088) * sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079) * bump sgl-kernel 0.0.8 (sgl-project#5089) * python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080) * bump v0.4.4.post4 (sgl-project#5091) * Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097) Co-authored-by: shuaills <shishuaiuoe@gmail.com> * Add Llama4 support (sgl-project#5092) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@163.com> * Fix refactor error - fp8.py (sgl-project#5106) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * bump v0.4.5 (sgl-project#5117) * [ci] fix llama4 ci error (sgl-project#5126) * Refactor and Optimize FA3 Code (sgl-project#5090) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> * Add Llama4 user guide (sgl-project#5133) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * [Misc] Use pytest.mark.skipif in sgl-kernel test (sgl-project#5137) * feat: disable grammar restrictions within reasoning sections (sgl-project#4984) Co-authored-by: tianhaoyu <thy@mail.ecust.edu.cn> Co-authored-by: DarkSharpness <2040703891@qq.com> * [modelopt] automatically inspect if model is ModelOpt quantized and set quantization method (sgl-project#5145) * [AMD] Fix missing per_token_group_quant_fp8 for ROCm (sgl-project#5140) * fix multimodal hash feature (sgl-project#5083) * Fix run time error in ROCm platform (sgl-project#5147) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: root <root@dell300x-pla-t10-17.pla.dcgpu> * [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (sgl-project#5103) * Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (sgl-project#4760) * Use public model for FA3 speculative decode testing (sgl-project#5152) * Add dummy grok test to amd CI. (sgl-project#5115) * fix empty_cache error in pt_weights_iterator (sgl-project#5151) Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> * Fix torch compile errors (sgl-project#5158) * Fix loading KV quantization scale; Enable modelopt kv cache (sgl-project#4686) Co-authored-by: qingquansong <ustcsqq@gmail.com> * [PD] Fix unclosed prefill connection warning of mini_lb (sgl-project#5155) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Add optimized native kernels in sgl-kernel (sgl-project#5150) Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: blzheng <beilei.zheng@intel.com> * [PD] Simplify mini LB (sgl-project#4911) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> * Small improvement of native api docs (sgl-project#5139) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> * [feat&refactor] Enhance multimodal input support with refactor io_struct (sgl-project#4938) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Support 2x8xH100 for Llama 4 (sgl-project#5159) * FP4 weight loading and inference (2/2) (sgl-project#3972) * Fix multimodal hashing error (sgl-project#5174) * Tiny disable model that does not work (sgl-project#5175) * [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173) * [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068) Co-authored-by: ch-wan <cwan39@gatech.edu> * docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * feat: add DeepGEMM build warning (sgl-project#5176) Co-authored-by: grimoire <streetyao@live.com> * fix: use DeepEPDispatcher on CUDA (sgl-project#5180) * [DeepEP] fix: import buffer error (sgl-project#5179) * Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058) * [Misc] clean up vllm in sgl-kernel test (sgl-project#5189) * Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185) Co-authored-by: wunhuang <wunhuang@amd.com> * Optimize topk operation in llama4 (sgl-project#5128) * Support Llama4 fp8 inference (sgl-project#5194) Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: zhyncs <me@zhyncs.com> * [ci] fix ci test fused_moe op (sgl-project#5102) * model: support mllama4 (sgl-project#5144) * update grok test (sgl-project#5171) * sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207) * Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196) * fix: log warning when disable cuda graph (sgl-project#5209) * [metrics] Add in queue metrics (sgl-project#4444) * Fix DeepSeek error when using DeepEP mode (sgl-project#5190) * reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086) * [PD] Support KV transfer with mooncake (sgl-project#4880) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Co-authored-by: shangmingc <csmthu@gmail.com> * [PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool (sgl-project#5204) * Update deps for mllama4 (sgl-project#5215) * Fix deepseek-v3 with torch.compile in PyTorch 2.6. (sgl-project#5213) * ROCm sgl-kernel: compatible to later torch (sgl-project#5167) * [Misc] Clean sgl-kernel test (sgl-project#5216) * Update Makefile / build script to avoid installing incompatible torch dependency (sgl-project#5245) * Fix torch.compile cacheing (sgl-project#5259) Co-authored-by: zhyncs <me@zhyncs.com> * ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations (sgl-project#5228) * Optimize attention in llama4 (sgl-project#5127) * Optimize GPU memory usage in FlashAttentionBackend's strided indexing (sgl-project#5262) Co-authored-by: ch-wan <cwan39@gatech.edu> * Support `--enable-llama4-multimodal` (sgl-project#5254) * [fix] fix mrope positions not picked up (sgl-project#5265) * doc: nested loop code for offline engine (sgl-project#5244) * fix: examples for token_in_token_out_vlm (sgl-project#5193) * Fix a 404 link in send_request.ipynb (sgl-project#5280) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * fix: enable fp4 compilation on cu128 (sgl-project#5286) * feat: add cu128 identifier for sgl-kernel (sgl-project#5287) * chore: relax the torch version restriction for sgl-kernel compilation (sgl-project#5288) * chore: bump sgl-kernel v0.0.8.post1 (sgl-project#5289) * [PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout (sgl-project#5292) * [Docs] Supported Model Docs - Major restructuring (sgl-project#5290) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> * fix: update update_wheel_index for cu128 (sgl-project#5300) * [Docs] Remove the older supported docs section (sgl-project#5301) * remove moe_align_block_size torch.zeros in small batch/expert mode (sgl-project#5298) * feat: add blackwell Dockerfile (sgl-project#5302) * feat: add blackwell workflow (sgl-project#5303) * fix: use fa3 unit test on hopper only (sgl-project#5304) * misc: update blackwell Dockerfile (sgl-project#5306) * fix: remove cublas_grouped_gemm (sgl-project#5307) * fix: update flash attn (sgl-project#5308) * fix: use deepgemm only on hopper (sgl-project#5310) * [VLM] Adopt fast image processor by default (sgl-project#5065) * Adjust ci test threshold (sgl-project#5271) * Blackwell Cutlass MLA kernel (sgl-project#5142) * misc: cleanup 3rdparty (sgl-project#5311) * update variable naming and comments for rocm (sgl-project#5299) * Fix w8a8_int8 model shared experts fusion load weights error (sgl-project#5120) * Add flash_attn_varlen_func to sgl-kernel (sgl-project#5315) * Fix fa3 window size setup (sgl-project#5316) * chore: bump sgl-kernel v0.0.8.post2 (sgl-project#5317) * feat: use fa3 mla by default on hopper (sgl-project#5210) Co-authored-by: yundai424 <yundai424@gmail.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> * Fix: docs/backend/structured_outputs.ipynb (sgl-project#4884) * Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… (sgl-project#5321) * refine fused_moe tuning docs (sgl-project#5294) * Support server based rollout in Verlengine (sgl-project#4848) Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com> * [Feat] Add sparse attn to sgl-kernel (sgl-project#5327) * fix: solve cu118 issue for cutlass mla (sgl-project#5331) * chore: bump sgl-kernel v0.0.8.post3 (sgl-project#5332) * ci: update release node (sgl-project#5333) * fix: determine if flashinfer is installed (sgl-project#5336) * feat: adapt merge_state (sgl-project#5337) * misc: update sagemaker Dockerfile (sgl-project#5341) * Fix: Ensure tensors for dist.broadcast match NCCL backend device (sgl-project#5322) * docs: update adoption and sponsorship list with Oracle (sgl-project#5343) * chore: upgrade sgl-kernel 0.0.8.post3 (sgl-project#5342) * Fix typo: infight -> inflight (sgl-project#5357) * [PD] Add transfer backend abstraction (sgl-project#5328) * fix MLATokenToKVPoolHost get_size_per_token bug (sgl-project#5161) Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> * fix sgl-project#5322 (sgl-project#5359) * feat: update experiment_runner (sgl-project#5360) * [DeepEP] Reduce routed scaling overhead (sgl-project#5277) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Fix DeepSeek DP Attention + torch compile (sgl-project#5367) Co-authored-by: ispobock <ispobaoke@163.com> * Support for Qwen2.5-VL Model in bitsandbytes Format (sgl-project#5003) * Fix PD disaggregation bugs (sgl-project#5326) * [PD Bug] fix MLA get_contiguous_buf_infos error (sgl-project#5384) * [perf] experimental enhance fp8 per-tensor quant (sgl-project#5370) * Apply deepseek cuda rope (sgl-project#5385) Co-authored-by: Yineng Zhang <me@zhyncs.com> * apply fused moe gate in ds v3/r1 (sgl-project#5371) Co-authored-by: Yineng Zhang <me@zhyncs.com> * fix: update test config (sgl-project#5392) * [Fix] Turn off DeepGEMM by default (sgl-project#5263) * minor clean up of sgl-kernel/CMakeLists.txt (sgl-project#5393) * Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5368) * Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5291) Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> * [fix/misc] remove duplicate row in deepseek v2 model (sgl-project#5279) * chore: upgrade DeepGEMM (sgl-project#5395) * fix: update pr-test-sgl-kernel (sgl-project#5399) * kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381) * chore: bump sgl-kernel 0.0.9 (sgl-project#5400) * chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401) * Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406) * Fix bench_serving with random-ids (sgl-project#5214) * [misc] fix ci flaky case (sgl-project#5352) * [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412) * Support dynamic connection and TP 16 (sgl-project#5351) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416) * [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: ybyang <ybyang7@iflytek.com> * Distinguish bootstrap key only in decode server (sgl-project#5422) * [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423) * [minor] cleanup cmakelists.txt (sgl-project#5420) * bugfix: fix merge_state_v2 cuda graph (sgl-project#5419) * chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430) * fix: solve release issue (sgl-project#5434) * BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431) * feat: update model_specific_adjustment (sgl-project#5344) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> * chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436) * Fix ignore_eos parameter when loading a chat template (sgl-project#5264) * add attention backend supporting matrix in the doc (sgl-project#5211) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> * Support BNB quantization for llama/mllama (sgl-project#5038) Co-authored-by: Yuhao Yang <yyh073@foxmail.com> * [Docs] Update start/install.md (sgl-project#5398) * [Minor] Move torch.compile patch to a better place (sgl-project#5397) * [Bug fix] need record start time in pd mode (sgl-project#5425) * Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113) * chore: bump v0.4.5.post1 (sgl-project#5445) * Revert "[SW-226289] rebase sglang to tag v0.4.5 (sgl-project#12)" This reverts commit 0eac714. --------- Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: Kay Yan <kay.yan@daocloud.io> Signed-off-by: windsonsea <haifeng.yao@daocloud.io> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Juwan Yoo <ryan@tmfi.us> Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: chaobo jia <91889375+jcbjcbjc@users.noreply.github.com> Co-authored-by: rudy152 <czh1137892874@gmail.com> Co-authored-by: Fr4nk1in <sh.fu@outlook.com> Co-authored-by: yinfan98 <1106310035@qq.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> Co-authored-by: SEPLOS <seplos@aliyun.com> Co-authored-by: ch-wan <cwan39@gatech.edu> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: GeLee <leege233@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com> Co-authored-by: Kaiyu Yang <yangky@umich.edu> Co-authored-by: renxin <90580890+renxinx@users.noreply.github.com> Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com> Co-authored-by: yuethe <yuethe@tencent.com> Co-authored-by: simveit <69345428+simveit@users.noreply.github.com> Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: Tommy Yang <tommyyang0524@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: tianhaoyu <thy@mail.ecust.edu.cn> Co-authored-by: DarkSharpness <2040703891@qq.com> Co-authored-by: Yun Dai <yundai424@gmail.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com> Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: root <root@dell300x-pla-t10-17.pla.dcgpu> Co-authored-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: DangKai <dangkai4u@outlook.com> Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: Ma Mingfei <mingfei.ma@intel.com> Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: blzheng <beilei.zheng@intel.com> Co-authored-by: Byron Hsu <byronhsu1230@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: Kay Yan <kay.yan@daocloud.io> Co-authored-by: grimoire <streetyao@live.com> Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: Zhaoyang Hao <77828610+Muuuchen@users.noreply.github.com> Co-authored-by: Teng Ma <805522925@qq.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Co-authored-by: Richard Zou <zou3519@users.noreply.github.com> Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com> Co-authored-by: Michael Yao <haifeng.yao@daocloud.io> Co-authored-by: Yusong Gao <yusong.gao@icloud.com> Co-authored-by: Zhaoyi Li <36555117+Lzy17@users.noreply.github.com> Co-authored-by: lambert0312 <lambert80.ios@gmail.com> Co-authored-by: tianlian yi <91449279+yitianlian@users.noreply.github.com> Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com> Co-authored-by: yulei <yuulei12@gmail.com> Co-authored-by: Yongtong Wu <914554688@qq.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com> Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> Co-authored-by: Yangcheng Li <bluebluelitchi@hotmail.com> Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: ybyang <ybyang7@iflytek.com> Co-authored-by: mRSun15 <3150105645@zju.edu.cn> Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com> Co-authored-by: Yuhao Yang <yyh073@foxmail.com>

BBuf added 2 commits April 5, 2025 15:35

reduce moe_align_block_size_kernel small batch mode ovehead

8f10861

upd

bb140ee

BBuf requested review from zhyncs, ispobock, HandH1998, yizhang2077, merrymercy, FlamingoPg, Ying1123 and HaiShaw as code owners April 5, 2025 15:48

BBuf and others added 5 commits April 5, 2025 23:52

lint

e9d0131

Merge branch 'main' into opt_moe_align_block_kernel_small_batch

e7f6438

refine

52748e0

upd

fdb5abe

Merge branch 'main' into opt_moe_align_block_kernel_small_batch

b9e8baa

Merge branch 'main' into opt_moe_align_block_kernel_small_batch

5ce2587

remove atomic add in small num_token and experts mode

c2a338a

BBuf and others added 7 commits April 8, 2025 22:33

Merge branch 'main' into opt_moe_align_block_kernel_small_batch

3cfbf7f

Merge branch 'main' into opt_moe_align_block_kernel_small_batch

772c39a

Merge branch 'main' into opt_moe_align_block_kernel_small_batch

c49f94a

Merge branch 'main' into opt_moe_align_block_kernel_small_batch

41232f1

Merge branch 'main' into opt_moe_align_block_kernel_small_batch

c679e84

Merge branch 'main' into opt_moe_align_block_kernel_small_batch

5eb277e

Merge branch 'main' into opt_moe_align_block_kernel_small_batch

90b0ea9

fzyzcjy changed the title ~~reduce moe_align_block_size_kernel small batch mode ovehead~~ reduce moe_align_block_size_kernel small batch mode overhead Apr 9, 2025

Merge branch 'main' into opt_moe_align_block_kernel_small_batch

76b359b

Merge branch 'main' into opt_moe_align_block_kernel_small_batch

96f53e6

fzyzcjy reviewed Apr 9, 2025

View reviewed changes

BBuf and others added 5 commits April 9, 2025 11:14

refine

5a59807

fix comment

428c1bf

Merge branch 'opt_moe_align_block_kernel_small_batch' of github.com:s…

0ddcf78

…gl-project/sglang into opt_moe_align_block_kernel_small_batch

Merge branch 'main' into opt_moe_align_block_kernel_small_batch

7f1cb67

Merge branch 'main' into opt_moe_align_block_kernel_small_batch

a2e5c13

merrymercy approved these changes Apr 9, 2025

View reviewed changes

BBuf added 2 commits April 10, 2025 08:40

Merge branch 'main' into opt_moe_align_block_kernel_small_batch

5b606fa

Merge branch 'main' into opt_moe_align_block_kernel_small_batch

6e56696

zhyncs merged commit f730362 into main Apr 10, 2025
2 of 22 checks passed

zhyncs deleted the opt_moe_align_block_kernel_small_batch branch April 10, 2025 00:59

finger92 pushed a commit to protagolabs/sglang that referenced this pull request Apr 10, 2025

reduce moe_align_block_size_kernel small batch mode overhead (sgl-pro…

1327311

…ject#5086)

thyecust pushed a commit to thyecust/sglang that referenced this pull request Apr 11, 2025

reduce moe_align_block_size_kernel small batch mode overhead (sgl-pro…

720818a

…ject#5086)

BBuf mentioned this pull request Apr 11, 2025

remove moe_align_block_size torch.zeros in small batch/expert mode #5298

Merged

6 tasks

jimoosciuc pushed a commit to Furion-cn/sglang that referenced this pull request Apr 17, 2025

reduce moe_align_block_size_kernel small batch mode overhead (sgl-pro…

d8b9957

…ject#5086)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reduce moe_align_block_size_kernel small batch mode overhead #5086

reduce moe_align_block_size_kernel small batch mode overhead #5086

Uh oh!

BBuf commented Apr 5, 2025 •

edited

Loading

Uh oh!

FlamingoPg commented Apr 5, 2025

Uh oh!

zhyncs commented Apr 6, 2025

Uh oh!

BBuf commented Apr 6, 2025

Uh oh!

BBuf commented Apr 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BBuf commented Apr 10, 2025

Uh oh!

Uh oh!

reduce moe_align_block_size_kernel small batch mode overhead #5086

reduce moe_align_block_size_kernel small batch mode overhead #5086

Uh oh!

Conversation

BBuf commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Acc test

Kernel unit-test

Benchmark In H200

Uh oh!

FlamingoPg commented Apr 5, 2025

Uh oh!

zhyncs commented Apr 6, 2025

Uh oh!

BBuf commented Apr 6, 2025

Uh oh!

BBuf commented Apr 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BBuf commented Apr 10, 2025

Uh oh!

Uh oh!

BBuf commented Apr 5, 2025 •

edited

Loading