Skip to content

Conversation

whchung
Copy link
Contributor

@whchung whchung commented Feb 13, 2025

Modifications

Use tl.range() in block GEMM kernels with num_stages set by host to hint Triton produce better software pipelining.

Checklist

Copy link
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

@whchung whchung marked this pull request as ready for review February 13, 2025 12:56
@whchung whchung force-pushed the whchung/_w8a8_block_fp8_matmul_num_stages2 branch 3 times, most recently from 461a6b8 to 447b2b2 Compare February 15, 2025 17:11
@whchung whchung changed the title Use tl.range() in block GEMM kernels with num_stages set by host. [ROCm] Use tl.range() in block GEMM kernels with num_stages set by host. Feb 15, 2025
@whchung whchung force-pushed the whchung/_w8a8_block_fp8_matmul_num_stages2 branch from 447b2b2 to 1656136 Compare February 15, 2025 18:22
@whchung whchung force-pushed the whchung/_w8a8_block_fp8_matmul_num_stages2 branch from 1656136 to 4b3f174 Compare February 15, 2025 18:22
@HaiShaw HaiShaw merged commit 03caefe into sgl-project:main Feb 16, 2025
14 of 19 checks passed
zhyncs added a commit that referenced this pull request Feb 17, 2025
@yiakwy-xpu-ml-framework-team
Copy link
Contributor

@whchung we have recent discussion on GEMM performance tuning . There is performance data from NVIDIA that as for GEMM, WASP is not as good as cooperative launch of kernel in many different shapes of tiles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants