-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[sgl-kernel] Add cuda kernel for moe_ep_silu_and_mul #6919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! gemini-code-assist here, providing a summary of this pull request to help everyone get up to speed. This PR, marked as Work In Progress (WIP), introduces a new CUDA kernel specifically designed for the SiLU activation followed by multiplication, intended for use within the Mixture-of-Experts (MoE) layers. The goal is likely to provide a highly optimized implementation of this common operation on the GPU.
The changes involve adding the core CUDA kernel implementation, registering it with the PyTorch extension mechanism, creating C++ and Python wrappers to make it callable from Python, and adding a unit test to verify its correctness by comparing its output against an existing Triton kernel implementation.
Highlights
- New CUDA Kernel: Adds a new CUDA kernel (
ep_moe_act_and_mul_cuda_kernel
) to perform the SiLU activation and multiplication operation efficiently on the GPU for MoE layers. - PyTorch Integration: Registers the new CUDA kernel with the PyTorch extension (
sgl_kernel
) and provides C++ and Python wrappers (ep_moe_silu_and_mul
) to expose it to the Python API. - Unit Testing: Includes a new unit test (
test_ep_moe_silu_and_mul_vs_triton
) that compares the output of the new CUDA kernel against an existing Triton implementation for various configurations (token counts, hidden sizes, data types) to ensure correctness.
Changelog
Click here to see the changelog
- sgl-kernel/csrc/common_extension.cc
- Registers the new
ep_moe_silu_and_mul
kernel with the PyTorchsgl_kernel
library fragment.
- Registers the new
- sgl-kernel/csrc/moe/ep_moe_silu_and_mul.cu
- Adds a new CUDA source file containing the
ep_moe_act_and_mul_cuda_kernel
which implements the SiLU and multiplication logic for MoE. - Includes a C++ wrapper function
ep_moe_silu_and_mul
to launch the CUDA kernel.
- Adds a new CUDA source file containing the
- sgl-kernel/include/sgl_kernel_ops.h
- Adds the C++ function declaration for
ep_moe_silu_and_mul
.
- Adds the C++ function declaration for
- sgl-kernel/include/utils.h
- Adds a device function
silu
for the SiLU activation.
- Adds a device function
- sgl-kernel/python/sgl_kernel/init.py
- Imports the new
ep_moe_silu_and_mul
function into the Python package's public API.
- Imports the new
- sgl-kernel/python/sgl_kernel/moe.py
- Adds a Python wrapper function
ep_moe_silu_and_mul
that calls the registered C++ kernel viatorch.ops.sgl_kernel
.
- Adds a Python wrapper function
- sgl-kernel/tests/test_ep_moe_silu_and_mul_kernel.py
- Adds a new test file with a Pytest test
test_ep_moe_silu_and_mul_vs_triton
. - This test compares the output of the new CUDA kernel against a Triton kernel implementation for correctness across different parameters and data types.
- Adds a new test file with a Pytest test
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new CUDA kernel ep_moe_silu_and_mul
for Mixture-of-Experts, along with its PyTorch bindings and Python tests. The initiative to add a custom CUDA kernel for this operation is good for performance optimization.
However, the CUDA kernel implementation (ep_moe_silu_and_mul.cu
) has several critical issues that will prevent compilation or lead to incorrect behavior. These include incorrect variable declarations, undefined variables, missing kernel arguments, and incorrect logic for loop bounds. The host wrapper function also has issues with kernel launch configuration and argument passing.
The Python tests comparing against a Triton kernel are a good step towards ensuring correctness once the CUDA kernel issues are resolved.
Given the WIP (Work In Progress) status in the PR title and the nature of the identified issues, significant revisions are needed before this can be merged.
Summary of Findings
- CUDA Kernel Compilation Errors: The CUDA kernel
ep_moe_act_and_mul_cuda_kernel
and its host wrapperep_moe_silu_and_mul
have several issues that will prevent compilation:
const float scale
modified (critical).- Undefined
total_blocks
in grid configuration (critical). - Missing
hidden_size
argument in kernel launch (critical). - Incorrect type cast for
scales.data_ptr()
in kernel launch (critical). - Undefined
tid
in kernel loop (critical). - Undefined
vec_t
type in kernel (critical).
- CUDA Kernel Logic Error: The
vec_elements
calculation for loop bounds in the CUDA kernel is based onhidden_size
but should be based onhalf_hidden_size
(high). - Unused Variable in CUDA Kernel: An unused variable
vec
is declared in the CUDA kernel (medium). - Non-English Comment in Test Code: A comment in
sgl-kernel/tests/test_ep_moe_silu_and_mul_kernel.py
(line 68) is in Chinese. While not a functional issue, English comments are generally preferred for broader accessibility in open-source projects. (Severity: low, not commented directly due to settings).
Merge Readiness
This pull request is currently a Work In Progress (WIP) and has several critical issues in the CUDA kernel implementation that need to be addressed. These issues will prevent compilation and/or lead to incorrect functionality. Therefore, the PR is not ready for merging.
I recommend addressing all the critical and high severity issues identified in the review comments. Once these are resolved, further review, especially focusing on performance and edge cases, would be beneficial. As a reviewer, I am not authorized to approve pull requests; please ensure other maintainers review and approve the changes before merging.
5694052
to
084e41f
Compare
d071ba5
to
32b09cc
Compare
Benchmark result, CUDA gains 10% improvement over Triton.
|
32b09cc
to
ce154f7
Compare
|
0048d0f
to
6b88282
Compare
def benchmark(batch_size, provider): | ||
dtype = torch.bfloat16 | ||
device = torch.device("cuda") | ||
hidden_size, topk, start_expert_id, end_expert_id, block_size = 4096, 8, 0, 255, 512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these variables also take different value combinations in the configs above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$python ./sgl-kernel/benchmark/bench_moe_silu_and_mul.py
Failed to import deepgemm, disable _ENABLE_JIT_DEEPGEMM.
Failed to import ceil_div from deep_gemm.
ep-moe-silu-and-mul-performance:
batch_size hidden_size block_size CUDA Kernel Triton Kernel
0 64.0 1024.0 128.0 6.656000 6.528000
1 64.0 1024.0 256.0 6.752000 6.784000
2 64.0 1024.0 512.0 6.784000 6.528000
3 64.0 2048.0 128.0 6.912000 7.552000
4 64.0 2048.0 256.0 6.848000 7.296000
5 64.0 2048.0 512.0 6.592000 7.296000
6 64.0 4096.0 128.0 7.040000 8.960000
7 64.0 4096.0 256.0 7.008000 8.672000
8 64.0 4096.0 512.0 6.752000 8.672000
9 64.0 8192.0 128.0 7.232000 11.360000
10 64.0 8192.0 256.0 6.976000 11.616000
11 64.0 8192.0 512.0 7.232000 11.328000
12 128.0 1024.0 128.0 6.592000 6.624000
13 128.0 1024.0 256.0 6.592000 6.656000
14 128.0 1024.0 512.0 6.848000 6.880000
15 128.0 2048.0 128.0 7.008000 7.744000
16 128.0 2048.0 256.0 7.008000 7.776000
17 128.0 2048.0 512.0 7.008000 7.488000
18 128.0 4096.0 128.0 7.232000 9.248000
19 128.0 4096.0 256.0 6.976000 9.248000
20 128.0 4096.0 512.0 7.200000 9.024000
21 128.0 8192.0 128.0 7.888000 12.064000
22 128.0 8192.0 256.0 7.616000 12.288000
23 128.0 8192.0 512.0 7.904000 12.064000
24 256.0 1024.0 128.0 6.816000 6.848000
25 256.0 1024.0 256.0 7.040000 6.848000
26 256.0 1024.0 512.0 7.072000 7.104000
27 256.0 2048.0 128.0 7.008000 7.872000
28 256.0 2048.0 256.0 6.976000 7.872000
29 256.0 2048.0 512.0 7.232000 8.128000
30 256.0 4096.0 128.0 7.680000 9.664000
31 256.0 4096.0 256.0 7.648000 9.696000
32 256.0 4096.0 512.0 7.904000 9.888000
33 256.0 8192.0 128.0 8.736000 13.408000
34 256.0 8192.0 256.0 8.960000 13.408000
35 256.0 8192.0 512.0 8.928000 13.376000
36 512.0 1024.0 128.0 8.512000 7.136000
37 512.0 1024.0 256.0 8.544000 7.424000
38 512.0 1024.0 512.0 8.544000 7.168000
39 512.0 2048.0 128.0 8.896000 8.704000
40 512.0 2048.0 256.0 8.896000 8.704000
41 512.0 2048.0 512.0 8.928000 8.448000
42 512.0 4096.0 128.0 9.696000 10.560000
43 512.0 4096.0 256.0 9.728000 10.592000
44 512.0 4096.0 512.0 9.472000 10.592000
45 512.0 8192.0 128.0 11.360000 14.592000
46 512.0 8192.0 256.0 11.360000 14.784000
47 512.0 8192.0 512.0 11.360000 14.560000
48 640.0 1024.0 128.0 9.024000 7.424000
49 640.0 1024.0 256.0 8.800000 7.648000
50 640.0 1024.0 512.0 9.024000 7.392000
51 640.0 2048.0 128.0 9.408000 8.736000
52 640.0 2048.0 256.0 9.472000 8.736000
53 640.0 2048.0 512.0 9.184000 8.960000
54 640.0 4096.0 128.0 10.432000 10.880000
55 640.0 4096.0 256.0 10.432000 10.912000
56 640.0 4096.0 512.0 10.144000 11.104000
57 640.0 8192.0 128.0 12.224000 15.424000
58 640.0 8192.0 256.0 12.480000 15.392000
59 640.0 8192.0 512.0 12.448000 15.616000
60 768.0 1024.0 128.0 9.472000 7.872000
61 768.0 1024.0 256.0 9.472000 7.872000
62 768.0 1024.0 512.0 9.728000 7.616000
63 768.0 2048.0 128.0 10.784000 9.408000
64 768.0 2048.0 256.0 10.528000 9.184000
65 768.0 2048.0 512.0 10.496000 9.184000
66 768.0 4096.0 128.0 11.040000 11.584000
67 768.0 4096.0 256.0 11.072000 11.584000
68 768.0 4096.0 512.0 11.072000 11.584000
69 768.0 8192.0 128.0 13.760000 16.192000
70 768.0 8192.0 256.0 13.760000 16.128000
71 768.0 8192.0 512.0 13.760000 16.416000
72 1024.0 1024.0 128.0 10.688000 8.000000
73 1024.0 1024.0 256.0 10.688000 8.000000
74 1024.0 1024.0 512.0 10.944000 8.000000
75 1024.0 2048.0 128.0 11.360000 9.632000
76 1024.0 2048.0 256.0 11.632000 9.824000
77 1024.0 2048.0 512.0 11.392000 9.632000
78 1024.0 4096.0 128.0 12.864000 12.544000
79 1024.0 4096.0 256.0 12.896000 12.768000
80 1024.0 4096.0 512.0 12.896000 12.512000
81 1024.0 8192.0 128.0 15.904000 18.912001
82 1024.0 8192.0 256.0 15.840000 18.912001
83 1024.0 8192.0 512.0 15.807999 18.912001
84 2048.0 1024.0 128.0 14.944000 10.208000
85 2048.0 1024.0 256.0 15.232000 9.920000
86 2048.0 1024.0 512.0 14.976000 10.240000
87 2048.0 2048.0 128.0 16.160000 13.088000
88 2048.0 2048.0 256.0 15.936000 13.120000
89 2048.0 2048.0 512.0 16.192000 13.312000
90 2048.0 4096.0 128.0 18.495999 18.688001
91 2048.0 4096.0 256.0 18.495999 18.464001
92 2048.0 4096.0 512.0 18.495999 18.464001
93 2048.0 8192.0 128.0 25.792001 30.592000
94 2048.0 8192.0 256.0 25.823999 30.880000
95 2048.0 8192.0 512.0 25.536001 30.880000
96 4096.0 1024.0 128.0 22.879999 13.632000
97 4096.0 1024.0 256.0 22.848001 13.648000
98 4096.0 1024.0 512.0 22.848001 13.440000
99 4096.0 2048.0 128.0 24.992000 18.816000
100 4096.0 2048.0 256.0 25.216000 18.592000
101 4096.0 2048.0 512.0 24.992000 18.816000
102 4096.0 4096.0 128.0 30.751999 30.015999
103 4096.0 4096.0 256.0 30.432001 30.080000
104 4096.0 4096.0 512.0 30.751999 30.272000
105 4096.0 8192.0 128.0 43.040000 50.783999
106 4096.0 8192.0 256.0 43.008000 50.816000
107 4096.0 8192.0 512.0 43.040000 50.816000
out_vec.store(dst_ptr + idx * vec_size); | ||
} | ||
|
||
#if (__CUDACC_VER_MAJOR__ >= 12 && defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's no necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
551289c
to
7682138
Compare
7682138
to
88c834e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
88c834e
to
de3a17e
Compare
|
de3a17e
to
a244149
Compare
|
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Motivation
Rewrite moe_ep_silu_and_mul Triton kernel in CUDA. Gains 10% performance improvement.
Modifications
Checklist