[ROCm] Use `tl.range()` in block GEMM kernels with `num_stages` set by host. #3535

whchung · 2025-02-13T01:55:22Z

Modifications

Use tl.range() in block GEMM kernels with num_stages set by host to hint Triton produce better software pipelining.

Checklist

Format your code according to the Code Formatting with Pre-Commit.

HaiShaw

LG

…s` set by host. (#3535)" This reverts commit 03caefe.

yiakwy-xpu-ml-framework-team · 2025-02-17T10:41:44Z

@whchung we have recent discussion on GEMM performance tuning . There is performance data from NVIDIA that as for GEMM, WASP is not as good as cooperative launch of kernel in many different shapes of tiles.

HaiShaw approved these changes Feb 13, 2025

View reviewed changes

whchung marked this pull request as ready for review February 13, 2025 12:56

whchung requested review from merrymercy, Ying1123, zhyncs and ispobock as code owners February 13, 2025 12:56

whchung force-pushed the whchung/_w8a8_block_fp8_matmul_num_stages2 branch 3 times, most recently from 461a6b8 to 447b2b2 Compare February 15, 2025 17:11

whchung changed the title ~~Use tl.range() in block GEMM kernels with num_stages set by host.~~ [ROCm] Use tl.range() in block GEMM kernels with num_stages set by host. Feb 15, 2025

whchung force-pushed the whchung/_w8a8_block_fp8_matmul_num_stages2 branch from 447b2b2 to 1656136 Compare February 15, 2025 18:22

whchung added 2 commits February 15, 2025 12:22

Use tl.range() in block GEMM kernels with num_stages set by host.

4684b42

Keep non-AMD path be in the original state.

4b3f174

whchung force-pushed the whchung/_w8a8_block_fp8_matmul_num_stages2 branch from 1656136 to 4b3f174 Compare February 15, 2025 18:22

HaiShaw added 2 commits February 15, 2025 23:55

Merge branch 'main' into whchung/_w8a8_block_fp8_matmul_num_stages2

63108dd

Merge branch 'main' into whchung/_w8a8_block_fp8_matmul_num_stages2

63ffc34

HaiShaw merged commit 03caefe into sgl-project:main Feb 16, 2025
14 of 19 checks passed

zhyncs added a commit that referenced this pull request Feb 17, 2025

Revert "[ROCm] Use tl.range() in block GEMM kernels with `num_stage…

f31ef1a

…s` set by host. (#3535)" This reverts commit 03caefe.

zhyncs mentioned this pull request Feb 17, 2025

Revert "[ROCm] Use tl.range() in block GEMM kernels with `num_stage… #3632

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Use `tl.range()` in block GEMM kernels with `num_stages` set by host. #3535

[ROCm] Use `tl.range()` in block GEMM kernels with `num_stages` set by host. #3535

Uh oh!

whchung commented Feb 13, 2025

Uh oh!

HaiShaw left a comment

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Feb 17, 2025

Uh oh!

Uh oh!

[ROCm] Use tl.range() in block GEMM kernels with num_stages set by host. #3535

[ROCm] Use tl.range() in block GEMM kernels with num_stages set by host. #3535

Uh oh!

Conversation

whchung commented Feb 13, 2025

Modifications

Checklist

Uh oh!

HaiShaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Feb 17, 2025

Uh oh!

Uh oh!

[ROCm] Use `tl.range()` in block GEMM kernels with `num_stages` set by host. #3535

[ROCm] Use `tl.range()` in block GEMM kernels with `num_stages` set by host. #3535