[Feature] Apply Cublas Grouped Gemm kernel #3629

Fridge003 · 2025-02-17T08:50:26Z

Motivation

Grouped Gemm kernel added in Cublas 12.5 is useful. It can be applied to MoE EP layer/Lora layer for acceleration.

Modifications

Add cublas_grouped_gemm in sgl-kernel library, and provides accuracy test/benchmark script.
Update document for this feature.

Environment:

Torch 2.5.1, Cuda 12.5, Cublas 12.5.3.2, sglang 0.4.3
Since sglang doesn't support torch 2.6 yet, to build the environment:

First make sure the Cuda version is >= 12.5 with nvcc -V
Then install sglang as the official document does
Reinstall cublas 12.5 through pip install nvidia-cublas-cu12==12.5.3.2 so that the cublas is upgraded
Compile the new sgl-kernel library.

Accuracy Test

python3 sgl-kernel/tests/test_cublas_grouped_gemm.py

Kernel Benchmark

Deepseek V2 setting

On Deepseek V2 setting with TP Size = 8 (Group Size=20), N = 3072, K = 5120:

!python3 sgl-kernel/benchmark/bench_cublas_grouped_gemm.py --models DeepSeek-V2 --tp-size 8

Result in GB per second:

Deepseek V2-Lite setting

On Deepseek V2 setting with TP Size = 2 (Group Size=32), N = 2816, K = 2048:

!python3 sgl-kernel/benchmark/bench_cublas_grouped_gemm.py --models DeepSeek-V2-Lite --tp-size 2

Result in GB per second:

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

sgl-kernel/src/sgl-kernel/csrc/cublas_grouped_gemm.cu

python/sglang/srt/layers/moe/ep_moe/kernels.py

yizhang2077 · 2025-02-17T17:32:30Z

Since pytorch 2.5.1 only supports cuda12.4 in official docs, and we can not change pytorch version easily, we need to update doc to guide user to reinstall pytorch if they want to use group gemm to accelerate their models.

yizhang2077 · 2025-02-18T02:26:30Z

LGTM cc @zhyncs

sgl-kernel/tests/test_cublas_grouped_gemm.py

zhyncs · 2025-02-18T07:18:46Z

amazing work!

Fridge003 requested review from zhyncs, ispobock, HandH1998, BBuf, yizhang2077, merrymercy, Ying1123, hnyls2002 and ByronHsu as code owners February 17, 2025 08:50

yizhang2077 reviewed Feb 17, 2025

View reviewed changes

sgl-kernel/src/sgl-kernel/csrc/cublas_grouped_gemm.cu Outdated Show resolved Hide resolved

yizhang2077 reviewed Feb 17, 2025

View reviewed changes

python/sglang/srt/layers/moe/ep_moe/kernels.py Outdated Show resolved Hide resolved

Fridge003 force-pushed the deepseek branch from 5908256 to 298aaa2 Compare February 17, 2025 19:52

Fridge003 changed the title ~~[Feature] Implement Cublas Grouped Gemm kernel and apply it to MoE EP~~ [Feature] Implement Cublas Grouped Gemm kernel Feb 17, 2025

Fridge003 changed the title ~~[Feature] Implement Cublas Grouped Gemm kernel~~ [Feature] Apply Cublas Grouped Gemm kernel Feb 17, 2025

Fridge003 force-pushed the deepseek branch from 298aaa2 to b4893e8 Compare February 18, 2025 00:43

yizhang2077 reviewed Feb 18, 2025

View reviewed changes

sgl-kernel/tests/test_cublas_grouped_gemm.py Show resolved Hide resolved

Fridge003 added 2 commits February 17, 2025 21:13

[Feature] Implement Cublas Grouped Gemm kernel

239f04f

skip test for cuda lower than 12.5

822dac2

Fridge003 force-pushed the deepseek branch from 8caca3b to 822dac2 Compare February 18, 2025 05:13

yizhang2077 self-requested a review February 18, 2025 05:41

yizhang2077 approved these changes Feb 18, 2025

View reviewed changes

Merge branch 'main' into deepseek

8f64b7d

zhyncs merged commit 67fc595 into sgl-project:main Feb 18, 2025
12 of 16 checks passed

Fridge003 deleted the deepseek branch February 18, 2025 19:12

hebiao064 mentioned this pull request Mar 12, 2025

[Bug] undefined symbol: cublasGemmGroupedBatchedEx after make build in sgl-kernel for CUDA126 #4357

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Apply Cublas Grouped Gemm kernel #3629

[Feature] Apply Cublas Grouped Gemm kernel #3629

Uh oh!

Fridge003 commented Feb 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

yizhang2077 commented Feb 17, 2025

Uh oh!

yizhang2077 commented Feb 18, 2025

Uh oh!

Uh oh!

Uh oh!

zhyncs commented Feb 18, 2025

Uh oh!

Uh oh!

[Feature] Apply Cublas Grouped Gemm kernel #3629

[Feature] Apply Cublas Grouped Gemm kernel #3629

Uh oh!

Conversation

Fridge003 commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Environment:

Accuracy Test

Kernel Benchmark

Deepseek V2 setting

Deepseek V2-Lite setting

Checklist

Uh oh!

Uh oh!

Uh oh!

yizhang2077 commented Feb 17, 2025

Uh oh!

yizhang2077 commented Feb 18, 2025

Uh oh!

Uh oh!

Uh oh!

zhyncs commented Feb 18, 2025

Uh oh!

Uh oh!

Fridge003 commented Feb 17, 2025 •

edited

Loading