Skip to content

Conversation

bnellnm
Copy link
Contributor

@bnellnm bnellnm commented Feb 26, 2025

Add the option to use DeepGemm's grouped gemm kernel for fused_moe operations.

Benchmark setup

python3 benchmarks/kernels/benchmark_moe.py --dtype=fp8_w8a8 -tp ${TP} --model deepseek-ai/DeepSeek-V3 --trust-remote-code --use-deep-gemm
Note: deepseek-ai/DeepSeek-R1 has similar problem sizes
E=256
N=4096 @ TP=1, 2048 @ TP=2, 1024 @ TP=4, 512 @ TP=8
K=7168
topk=8

H100 results

Batch DeepGemm TP=1 Triton TP=1 DeepGemm TP=2 Triton TP=2 DeepGemm TP=4 Triton TP=4 DeepGemm TP=8 Triton TP=8
128 6355.22 us 12480.49 us 3745.24 us 6025.39 us 2529.50 us 3121.16 us 1936.17 us 1732.33 us
256 6599.51 us 12782.98 us 3917.15 us 6164.44 us 2669.84 us 3236.88 us 2051.67 us 1728.73 us
512 7017.80 us 13017.54 us 4270.21 us 6631.64 us 3000.99 us 3433.29 us 2331.06 us 1810.36 us
1024 7850.04 us 13322.31 us 4874.43 us 6800.41 us 3575.18 us 3546.83 us 2886.08 us 1892.75 us
1536 8691.74 us 13777.98 us 5559.64 us 7037.34 us 4111.26 us 3677.90 us 3423.49 us 1988.70 us
2048 9545.87 us 19839.38 us 6217.40 us 10111.40 us 4699.03 us 5230.76 us 3974.65 us 2818.65 us
3072 11252.75 us 27189.87 us 7609.45 us 13845.57 us 5916.31 us 7172.64 us 5093.12 us 3858.15 us
4096 12843.01 us 33834.67 us 8669.07 us 17239.74 us 7167.47 us 8927.87 us 6225.71 us 4789.98 us

H200 results

Batch DeepGemm TP=1 Triton TP=1 DeepGemm TP=2 Triton TP=2 DeepGemm TP=4 Triton TP=4 DeepGemm TP=8 Triton TP=8
128 5206.92 us 10821.89 us 3205.37 us 5596.44 us 2257.24 us 2710.88 us 1824.80 us 1472.32 us
256 5415.36 us 11069.98 us 3373.36 us 5587.06 us 2394.34 us 2785.30 us 1952.46 us 1528.53 us
512 5825.33 us 13446.07 us 3711.86 us 6843.68 us 2732.21 us 3568.38 us 2229.71 us 1876.63 us
1024 6599.35 us 13754.83 us 4362.23 us 7016.18 us 3287.25 us 3681.09 us 2754.67 us 1946.25 us
1536 7387.39 us 14218.68 us 5024.80 us 7249.93 us 3851.06 us 3812.64 us 3308.97 us 2049.49 us
2048 8196.78 us 20468.26 us 5683.06 us 10424.77 us 4432.50 us 5384.37 us 3836.07 us 2898.91 us
3072 9842.72 us 28046.94 us 7005.50 us 14262.27 us 5623.91 us 7397.70 us 4915.28 us 3985.38 us
4096 11375.09 us 34898.63 us 8171.68 us 17769.66 us 6829.79 us 9232.80 us 6017.09 us 4942.76 us

I've run server benchmarks using DeepSeek-R1 but right now the Triton and DeepGemm kernels are at parity due to the TP=8 performance of the DeepGemm kernels. I think this can be improved by permute/unpermute ops (similar to #14568) and/or using EP instead of TP for fused_moe ops.

Note: tested with DeepGemm deepseek-ai/DeepGEMM@e1c070f. Some newer revisions result in wrong answers.

cc @mgoin, @ElizaWszola, @LucasWilkinson

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@tlrmchlsmth tlrmchlsmth changed the title Add test for deep gemm matmul Add test for DeepGEMM contiguous layout MoE kernels Feb 27, 2025
Copy link

mergify bot commented Feb 28, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link

mergify bot commented Mar 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 10, 2025
@bnellnm bnellnm marked this pull request as ready for review March 10, 2025 22:57
@bnellnm bnellnm changed the title Add test for DeepGEMM contiguous layout MoE kernels Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. Mar 11, 2025
Copy link

mergify bot commented Mar 26, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 26, 2025
Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how large is the wheel? do we want to ship it by default?

@bnellnm
Copy link
Contributor Author

bnellnm commented Mar 26, 2025

how large is the wheel? do we want to ship it by default?

The DeepGemm repo is around ~200Mb. Most of that comes from the cutlass files. I don't think there are any whl files for this repo though.

There's also not anything to build really. All the kernels are jitted at runtime so DeepGemm only needs the cutlass sources.

bnellnm added 17 commits April 1, 2025 01:30
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
auto-merge was automatically disabled April 1, 2025 01:30

Head branch was pushed to by a user without write access

@tlrmchlsmth tlrmchlsmth merged commit e59ca94 into vllm-project:main Apr 1, 2025
40 checks passed
@github-project-automation github-project-automation bot moved this from In progress to Done in DeepSeek V3/R1 Apr 1, 2025
Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025
…oE operations. (vllm-project#13932)

Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
…oE operations. (vllm-project#13932)

Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
…oE operations. (vllm-project#13932)

Signed-off-by: Bill Nell <bnell@redhat.com>
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
…oE operations. (vllm-project#13932)

Signed-off-by: Bill Nell <bnell@redhat.com>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
…oE operations. (vllm-project#13932)

Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants