Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. #13932

bnellnm · 2025-02-26T23:10:53Z

Add the option to use DeepGemm's grouped gemm kernel for fused_moe operations.

Benchmark setup

python3 benchmarks/kernels/benchmark_moe.py --dtype=fp8_w8a8 -tp ${TP} --model deepseek-ai/DeepSeek-V3 --trust-remote-code --use-deep-gemm
Note: deepseek-ai/DeepSeek-R1 has similar problem sizes
E=256
N=4096 @ TP=1, 2048 @ TP=2, 1024 @ TP=4, 512 @ TP=8
K=7168
topk=8

H100 results

Batch	DeepGemm TP=1	Triton TP=1	DeepGemm TP=2	Triton TP=2	DeepGemm TP=4	Triton TP=4	DeepGemm TP=8	Triton TP=8
128	6355.22 us	12480.49 us	3745.24 us	6025.39 us	2529.50 us	3121.16 us	1936.17 us	1732.33 us
256	6599.51 us	12782.98 us	3917.15 us	6164.44 us	2669.84 us	3236.88 us	2051.67 us	1728.73 us
512	7017.80 us	13017.54 us	4270.21 us	6631.64 us	3000.99 us	3433.29 us	2331.06 us	1810.36 us
1024	7850.04 us	13322.31 us	4874.43 us	6800.41 us	3575.18 us	3546.83 us	2886.08 us	1892.75 us
1536	8691.74 us	13777.98 us	5559.64 us	7037.34 us	4111.26 us	3677.90 us	3423.49 us	1988.70 us
2048	9545.87 us	19839.38 us	6217.40 us	10111.40 us	4699.03 us	5230.76 us	3974.65 us	2818.65 us
3072	11252.75 us	27189.87 us	7609.45 us	13845.57 us	5916.31 us	7172.64 us	5093.12 us	3858.15 us
4096	12843.01 us	33834.67 us	8669.07 us	17239.74 us	7167.47 us	8927.87 us	6225.71 us	4789.98 us

H200 results

Batch	DeepGemm TP=1	Triton TP=1	DeepGemm TP=2	Triton TP=2	DeepGemm TP=4	Triton TP=4	DeepGemm TP=8	Triton TP=8
128	5206.92 us	10821.89 us	3205.37 us	5596.44 us	2257.24 us	2710.88 us	1824.80 us	1472.32 us
256	5415.36 us	11069.98 us	3373.36 us	5587.06 us	2394.34 us	2785.30 us	1952.46 us	1528.53 us
512	5825.33 us	13446.07 us	3711.86 us	6843.68 us	2732.21 us	3568.38 us	2229.71 us	1876.63 us
1024	6599.35 us	13754.83 us	4362.23 us	7016.18 us	3287.25 us	3681.09 us	2754.67 us	1946.25 us
1536	7387.39 us	14218.68 us	5024.80 us	7249.93 us	3851.06 us	3812.64 us	3308.97 us	2049.49 us
2048	8196.78 us	20468.26 us	5683.06 us	10424.77 us	4432.50 us	5384.37 us	3836.07 us	2898.91 us
3072	9842.72 us	28046.94 us	7005.50 us	14262.27 us	5623.91 us	7397.70 us	4915.28 us	3985.38 us
4096	11375.09 us	34898.63 us	8171.68 us	17769.66 us	6829.79 us	9232.80 us	6017.09 us	4942.76 us

I've run server benchmarks using DeepSeek-R1 but right now the Triton and DeepGemm kernels are at parity due to the TP=8 performance of the DeepGemm kernels. I think this can be improved by permute/unpermute ops (similar to #14568) and/or using EP instead of TP for fused_moe ops.

Note: tested with DeepGemm deepseek-ai/DeepGEMM@e1c070f. Some newer revisions result in wrong answers.

cc @mgoin, @ElizaWszola, @LucasWilkinson

github-actions · 2025-02-26T23:11:02Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-02-28T04:24:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-03-10T18:19:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/fused_moe/fused_moe.py

mergify · 2025-03-26T01:21:02Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

youkaichao

how large is the wheel? do we want to ship it by default?

bnellnm · 2025-03-26T01:47:32Z

how large is the wheel? do we want to ship it by default?

The DeepGemm repo is around ~200Mb. Most of that comes from the cutlass files. I don't think there are any whl files for this repo though.

There's also not anything to build really. All the kernels are jitted at runtime so DeepGemm only needs the cutlass sources.

vllm/model_executor/layers/fused_moe/fused_moe.py

Signed-off-by: Bill Nell <bnell@redhat.com>

…oE operations. (vllm-project#13932) Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

…oE operations. (vllm-project#13932) Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

…oE operations. (vllm-project#13932) Signed-off-by: Bill Nell <bnell@redhat.com>

…oE operations. (vllm-project#13932) Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

tlrmchlsmth changed the title ~~Add test for deep gemm matmul~~ Add test for DeepGEMM contiguous layout MoE kernels Feb 27, 2025

mergify bot added the needs-rebase label Feb 28, 2025

bnellnm force-pushed the deep-gemm branch from 7da57d4 to bdfcd66 Compare March 1, 2025 00:32

mergify bot added ci/build and removed needs-rebase labels Mar 1, 2025

mergify bot added the needs-rebase label Mar 10, 2025

bnellnm marked this pull request as ready for review March 10, 2025 22:57

bnellnm requested review from tlrmchlsmth and WoosukKwon as code owners March 10, 2025 22:57

bnellnm changed the title ~~Add test for DeepGEMM contiguous layout MoE kernels~~ Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. Mar 11, 2025

bnellnm force-pushed the deep-gemm branch from cac9e58 to 8d688df Compare March 11, 2025 19:28

bnellnm requested review from mgoin and robertgshaw2-redhat as code owners March 12, 2025 04:23

huangtingwei9988 reviewed Mar 12, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_moe.py Outdated Show resolved Hide resolved

bnellnm force-pushed the deep-gemm branch from 1250e3f to 33c4c3e Compare March 17, 2025 16:15

bnellnm mentioned this pull request Mar 18, 2025

permute/unpermute kernel for moe optimization #14568

Merged

bnellnm force-pushed the deep-gemm branch from b9ae89a to d546938 Compare March 25, 2025 15:42

mergify bot removed the needs-rebase label Mar 25, 2025

simon-mo added this to DeepSeek V3/R1 Mar 25, 2025

github-project-automation bot moved this to Backlog in DeepSeek V3/R1 Mar 25, 2025

youkaichao reviewed Mar 26, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_moe.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Mar 26, 2025

youkaichao reviewed Mar 26, 2025

View reviewed changes

LucasWilkinson reviewed Mar 26, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_moe.py Outdated Show resolved Hide resolved

LucasWilkinson reviewed Mar 26, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_moe.py Outdated Show resolved Hide resolved

bnellnm added 17 commits April 1, 2025 01:30

review comments

ee9b9ef

Signed-off-by: Bill Nell <bnell@redhat.com>

clean up use_dg flags

82e2696

Signed-off-by: Bill Nell <bnell@redhat.com>

remove check for aligned M

83f82c2

Signed-off-by: Bill Nell <bnell@redhat.com>

rebase + clean up test

0448d43

Signed-off-by: Bill Nell <bnell@redhat.com>

fix format

a222d94

Signed-off-by: Bill Nell <bnell@redhat.com>

fix comment

ece3fda

Signed-off-by: Bill Nell <bnell@redhat.com>

refactor deep gemm implementation into separate function

d1fb8f5

Signed-off-by: Bill Nell <bnell@redhat.com>

format + test cleanups

61e843f

Signed-off-by: Bill Nell <bnell@redhat.com>

fix benchmark_moe.py

7348f2e

Signed-off-by: Bill Nell <bnell@redhat.com>

add expert_map check + assert

7619979

Signed-off-by: Bill Nell <bnell@redhat.com>

fix assert

36bf32f

Signed-off-by: Bill Nell <bnell@redhat.com>

add some platform checks for deep gemm usage

840560b

Signed-off-by: Bill Nell <bnell@redhat.com>

move platform check inside function

e7f3a3f

Signed-off-by: Bill Nell <bnell@redhat.com>

make linter happy

686a5ab

Signed-off-by: Bill Nell <bnell@redhat.com>

redo deep_gemm import stuff

a238488

Signed-off-by: Bill Nell <bnell@redhat.com>

whitespace

da01930

Signed-off-by: Bill Nell <bnell@redhat.com>

ping

ee213d9

Signed-off-by: Bill Nell <bnell@redhat.com>

auto-merge was automatically disabled April 1, 2025 01:30
Head branch was pushed to by a user without write access

bnellnm force-pushed the deep-gemm branch from 8b6bddc to ee213d9 Compare April 1, 2025 01:30

tlrmchlsmth merged commit e59ca94 into vllm-project:main Apr 1, 2025
40 checks passed

github-project-automation bot moved this from In progress to Done in DeepSeek V3/R1 Apr 1, 2025

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

yizhang2077 mentioned this pull request Apr 17, 2025

[Model] Adding Qwen3 and Qwen3MoE sgl-project/sglang#4693

Merged

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

Add option to use DeepGemm contiguous grouped gemm kernel for fused M…

33d67b8

…oE operations. (vllm-project#13932) Signed-off-by: Bill Nell <bnell@redhat.com>

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

Add option to use DeepGemm contiguous grouped gemm kernel for fused M…

6d2216a

…oE operations. (vllm-project#13932) Signed-off-by: Bill Nell <bnell@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. #13932

Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. #13932

Uh oh!

bnellnm commented Feb 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 26, 2025

Uh oh!

mergify bot commented Feb 28, 2025

Uh oh!

mergify bot commented Mar 10, 2025

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Mar 26, 2025

Uh oh!

youkaichao left a comment

Uh oh!

bnellnm commented Mar 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. #13932

Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. #13932

Uh oh!

Conversation

bnellnm commented Feb 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark setup

H100 results

H200 results

Uh oh!

github-actions bot commented Feb 26, 2025

Uh oh!

mergify bot commented Feb 28, 2025

Uh oh!

mergify bot commented Mar 10, 2025

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Mar 26, 2025

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

bnellnm commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bnellnm commented Feb 26, 2025 •

edited by github-actions bot

Loading

bnellnm commented Mar 26, 2025 •

edited

Loading