use warp shuffle style reduce and flashinfer vectorize #3628

BBuf · 2025-02-17T08:20:48Z

use warp shuffle style reduce.
use flashinfer vectorize.

main:


✅ All implementations match
per-token-group-quant-fp8-performance:
    batch_size  seq_len  group_size       Triton   SGL Kernel
0          1.0     64.0       128.0    10.368000     9.344000
1          1.0    128.0       128.0    14.304000    12.032000
2          1.0    256.0       128.0    19.967999    14.208000
3          1.0    512.0       128.0    30.432001    20.000000
4          1.0   1024.0       128.0    52.767999    31.647999
5          1.0   2048.0       128.0    96.928000    55.840001
6          2.0     64.0       128.0    14.336000    12.032000
7          2.0    128.0       128.0    20.064000    14.272000
8          2.0    256.0       128.0    30.495999    20.096000
9          2.0    512.0       128.0    52.623998    31.647999
10         2.0   1024.0       128.0    96.896000    55.808000
11         2.0   2048.0       128.0   184.640005   100.351997
12         4.0     64.0       128.0    20.000000    14.304000
13         4.0    128.0       128.0    30.495999    20.064000
14         4.0    256.0       128.0    52.639998    31.647999
15         4.0    512.0       128.0    96.928000    55.872001
16         4.0   1024.0       128.0   184.527993   100.415997
17         4.0   2048.0       128.0   358.815998   188.863993
18         8.0     64.0       128.0    30.495999    20.096000
19         8.0    128.0       128.0    52.607998    31.599998
20         8.0    256.0       128.0    96.960001    55.776000
21         8.0    512.0       128.0   184.576005   100.383997
22         8.0   1024.0       128.0   358.815998   188.863993
23         8.0   2048.0       128.0   707.199991   365.696013
24        16.0     64.0       128.0    52.623998    31.711999
25        16.0    128.0       128.0    96.960001    55.808000
26        16.0    256.0       128.0   184.512004   100.319996
27        16.0    512.0       128.0   358.880013   188.896000
28        16.0   1024.0       128.0   707.552016   365.599990
29        16.0   2048.0       128.0  1404.448032   718.303978
30        32.0     64.0       128.0    96.928000    55.808000
31        32.0    128.0       128.0   184.512004   100.415997
32        32.0    256.0       128.0   358.848006   188.767999
33        32.0    512.0       128.0   707.455993   365.503997
34        32.0   1024.0       128.0  1404.160023   718.144000
35        32.0   2048.0       128.0  2798.991919  1424.703956
36        64.0     64.0       128.0   184.512004   100.447997
37        64.0    128.0       128.0   358.864009   188.991994
38        64.0    256.0       128.0   707.455993   365.920007
39        64.0    512.0       128.0  1404.255986   718.335986
40        64.0   1024.0       128.0  2798.896074  1424.512029
41        64.0   2048.0       128.0  5587.423801  2836.800098


pr:

✅ All implementations match
per-token-group-quant-fp8-performance:
    batch_size  seq_len  group_size       Triton   SGL Kernel
0          1.0     64.0       128.0    10.432000     8.928000
1          1.0    128.0       128.0    14.272000    11.488000
2          1.0    256.0       128.0    20.064000    13.696000
3          1.0    512.0       128.0    30.432001    20.032000
4          1.0   1024.0       128.0    52.767999    29.983999
5          1.0   2048.0       128.0    96.992001    53.472001
6          2.0     64.0       128.0    14.272000    11.584000
7          2.0    128.0       128.0    20.064000    13.664000
8          2.0    256.0       128.0    30.495999    19.936001
9          2.0    512.0       128.0    52.639998    29.952001
10         2.0   1024.0       128.0    96.896000    53.472001
11         2.0   2048.0       128.0   184.543997    97.024001
12         4.0     64.0       128.0    20.032000    13.664000
13         4.0    128.0       128.0    30.528000    19.936001
14         4.0    256.0       128.0    52.607998    29.952001
15         4.0    512.0       128.0    96.896000    53.472001
16         4.0   1024.0       128.0   184.592009    96.928000
17         4.0   2048.0       128.0   358.815998   182.047993
18         8.0     64.0       128.0    30.495999    20.000000
19         8.0    128.0       128.0    52.576002    29.983999
20         8.0    256.0       128.0    96.928000    53.440001
21         8.0    512.0       128.0   184.479997    96.944004
22         8.0   1024.0       128.0   358.815998   182.111993
23         8.0   2048.0       128.0   707.264006   352.223992
24        16.0     64.0       128.0    52.607998    29.983999
25        16.0    128.0       128.0    96.896000    53.472001
26        16.0    256.0       128.0   184.496000    96.896000
27        16.0    512.0       128.0   358.815998   182.016000
28        16.0   1024.0       128.0   707.488000   352.288008
29        16.0   2048.0       128.0  1404.224038   692.128003
30        32.0     64.0       128.0    96.864000    53.440001
31        32.0    128.0       128.0   184.448004    96.896000
32        32.0    256.0       128.0   358.783990   182.080001
33        32.0    512.0       128.0   707.423985   352.160007
34        32.0   1024.0       128.0  1404.639959   692.224026
35        32.0   2048.0       128.0  2798.768044  1372.623920
36        64.0     64.0       128.0   184.479997    96.864000
37        64.0    128.0       128.0   358.783990   182.175994
38        64.0    256.0       128.0   707.584023   352.400005
39        64.0    512.0       128.0  1404.575944   692.319989
40        64.0   1024.0       128.0  2799.504042  1372.943997
41        64.0   2048.0       128.0  5588.575840  2734.496117

sgl-kernel/benchmark/bench_per_token_group_quant_fp8.py

sgl-kernel/src/sgl-kernel/csrc/per_token_group_quant_fp8.cu

yiakwy-xpu-ml-framework-team · 2025-02-17T10:24:52Z

@BBuf Great job! I can help you to compile the codes in AMD chip facilitate merging of your algorithm.

BBuf added 2 commits February 17, 2025 08:16

use warp shuffle reduce

7394d3c

upd

b8ea6b2

BBuf requested review from zhyncs, ispobock, HandH1998, yizhang2077 and merrymercy as code owners February 17, 2025 08:20

BBuf commented Feb 17, 2025

View reviewed changes

sgl-kernel/benchmark/bench_per_token_group_quant_fp8.py Show resolved Hide resolved

upd

051d551

yiakwy-xpu-ml-framework-team reviewed Feb 17, 2025

View reviewed changes

sgl-kernel/src/sgl-kernel/csrc/per_token_group_quant_fp8.cu Show resolved Hide resolved

yiakwy-xpu-ml-framework-team reviewed Feb 17, 2025

View reviewed changes

sgl-kernel/src/sgl-kernel/csrc/per_token_group_quant_fp8.cu Show resolved Hide resolved

BBuf and others added 3 commits February 19, 2025 04:06

add vectorize

716ac72

upd

136a67c

Merge branch 'main' into use__shfl_xor_sync_style_reduce

27e2481

BBuf changed the title ~~use warp shuffle style reduce~~ use warp shuffle style reduce and flashinfer vectorize Feb 19, 2025

Merge branch 'main' into use__shfl_xor_sync_style_reduce

d786c4a

zhyncs merged commit 55a7ec3 into main Feb 19, 2025
2 of 6 checks passed

zhyncs deleted the use__shfl_xor_sync_style_reduce branch February 19, 2025 12:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use warp shuffle style reduce and flashinfer vectorize #3628

use warp shuffle style reduce and flashinfer vectorize #3628

Uh oh!

BBuf commented Feb 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Feb 17, 2025

Uh oh!

Uh oh!

Uh oh!

use warp shuffle style reduce and flashinfer vectorize #3628

use warp shuffle style reduce and flashinfer vectorize #3628

Uh oh!

Conversation

BBuf commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Feb 17, 2025

Uh oh!

Uh oh!

Uh oh!

BBuf commented Feb 17, 2025 •

edited

Loading