Skip to content

Conversation

ElizaWszola
Copy link
Contributor

@ElizaWszola ElizaWszola commented Jun 23, 2025

Add support for blocked fp8 CUTLASS MoE for SM90.

Testing:

Single grouped multiply unit tests:

pytest tests/kernels/quantization/test_cutlass_scaled_mm.py -k test_cutlass_fp8_group_gemm

Fused experts op unit tests:

pytest tests/kernels/moe/test_cutlass_moe.py -k test_blocked_cutlass_moe_8

LMM for offline inference test:

llm = LLM(model="Qwen/Qwen3-30B-A3B-FP8")

Performance:

Currenlty, blocked CUTLASS is slightly worse than Triton on average, but beneficial for some shapes:

vLLM CUTLASS Implementation:
+------+-------+-------+----+------+-----------+--------+-------+--------------+
| m    | n     | k     | e  | topk | Time (μs) | TFLOPS | GB/s  | vs Triton    |
+------+-------+-------+----+------+-----------+--------+-------+--------------+
|   64 | 24576 |  1536 |  8 |    1 | 423.5     | 11.4   | 96.8  | 1.00x slower |
|   64 | 24576 |  1536 |  8 |    6 | 528.9     | 9.1    | 77.5  | 0.98x slower |
|   64 | 24576 |  1536 |  8 |    8 | 566.3     | 8.5    | 72.4  | 0.97x slower |
|   64 | 32768 |   512 |  8 |    1 | 307.2     | 7.0    | 68.4  | 1.04x faster |
|   64 | 32768 |   512 |  8 |    6 | 419.6     | 5.1    | 50.1  | 1.00x slower |
|   64 | 32768 |   512 |  8 |    8 | 469.4     | 4.6    | 44.7  | 0.96x slower |
|   64 |  7168 | 16384 |  8 |    1 | 1061.5    | 14.2   | 112.5 | 0.95x slower |
|   64 |  7168 | 16384 |  8 |    6 | 1142.3    | 13.2   | 104.5 | 0.95x slower |
|   64 |  7168 | 16384 |  8 |    8 | 1174.9    | 12.8   | 101.6 | 0.96x slower |
|   64 |  4096 |  7168 |  8 |    1 | 338.8     | 11.1   | 89.6  | 0.90x slower |
|   64 |  4096 |  7168 |  8 |    6 | 393.6     | 9.5    | 77.1  | 0.83x slower |
|   64 |  4096 |  7168 |  8 |    8 | 411.9     | 9.1    | 73.7  | 0.83x slower |
|   64 |  7168 |  2048 |  8 |    1 | 284.2     | 6.6    | 55.3  | 1.06x faster |
|   64 |  7168 |  2048 |  8 |    6 | 286.2     | 6.6    | 55.0  | 1.07x faster |
|   64 |  7168 |  2048 |  8 |    8 | 286.6     | 6.6    | 54.9  | 1.07x faster |
|  128 | 24576 |  1536 |  8 |    1 | 449.6     | 21.5   | 98.4  | 0.98x slower |
|  128 | 24576 |  1536 |  8 |    6 | 703.3     | 13.7   | 62.9  | 0.97x slower |
|  128 | 24576 |  1536 |  8 |    8 | 801.2     | 12.1   | 55.2  | 0.92x slower |
|  128 | 32768 |   512 |  8 |    1 | 326.0     | 13.2   | 77.4  | 1.03x faster |
|  128 | 32768 |   512 |  8 |    6 | 574.8     | 7.5    | 43.9  | 1.03x faster |
|  128 | 32768 |   512 |  8 |    8 | 688.3     | 6.2    | 36.7  | 0.95x slower |
|  128 |  7168 | 16384 |  8 |    1 | 1089.1    | 27.6   | 111.4 | 0.95x slower |
|  128 |  7168 | 16384 |  8 |    6 | 1300.5    | 23.1   | 93.3  | 1.14x faster |
|  128 |  7168 | 16384 |  8 |    8 | 1385.6    | 21.7   | 87.6  | 1.12x faster |
|  128 |  4096 |  7168 |  8 |    1 | 350.5     | 21.4   | 89.4  | 0.87x slower |
|  128 |  4096 |  7168 |  8 |    6 | 475.8     | 15.8   | 65.8  | 0.94x slower |
|  128 |  4096 |  7168 |  8 |    8 | 513.0     | 14.7   | 61.1  | 0.91x slower |
|  128 |  7168 |  2048 |  8 |    1 | 285.4     | 13.2   | 58.8  | 1.05x faster |
|  128 |  7168 |  2048 |  8 |    6 | 325.1     | 11.6   | 51.6  | 0.93x slower |
|  128 |  7168 |  2048 |  8 |    8 | 363.3     | 10.3   | 46.2  | 0.84x slower |
| 4096 | 24576 |  1536 |  8 |    1 | 2810.2    | 110.0  | 87.3  | 0.93x slower |
| 4096 | 24576 |  1536 |  8 |    6 | 16116.9   | 19.2   | 15.2  | 0.87x slower |
| 4096 | 24576 |  1536 |  8 |    8 | 21638.2   | 14.3   | 11.3  | 0.85x slower |
| 4096 | 32768 |   512 |  8 |    1 | 2321.9    | 59.2   | 123.7 | 0.96x slower |
| 4096 | 32768 |   512 |  8 |    6 | 12800.4   | 10.7   | 22.4  | 0.91x slower |
| 4096 | 32768 |   512 |  8 |    8 | 17160.9   | 8.0    | 16.7  | 0.89x slower |
| 4096 |  7168 | 16384 |  8 |    1 | 5572.2    | 172.7  | 43.7  | 1.08x faster |
| 4096 |  7168 | 16384 |  8 |    6 | 31637.9   | 30.4   | 7.7   | 0.97x slower |
| 4096 |  7168 | 16384 |  8 |    8 | 42752.4   | 22.5   | 5.7   | 0.94x slower |
| 4096 |  4096 |  7168 |  8 |    1 | 2096.5    | 114.7  | 44.0  | 0.84x slower |
| 4096 |  4096 |  7168 |  8 |    6 | 10337.4   | 23.3   | 8.9   | 0.83x slower |
| 4096 |  4096 |  7168 |  8 |    8 | 13602.7   | 17.7   | 6.8   | 0.82x slower |
| 4096 |  7168 |  2048 |  8 |    1 | 1317.6    | 91.3   | 62.1  | 0.77x slower |
| 4096 |  7168 |  2048 |  8 |    6 | 6860.7    | 17.5   | 11.9  | 0.75x slower |
| 4096 |  7168 |  2048 |  8 |    8 | 9054.1    | 13.3   | 9.0   | 0.75x slower |
|   64 | 24576 |  1536 | 40 |    1 | 1362.0    | 3.5    | 30.1  | 0.99x slower |
|   64 | 24576 |  1536 | 40 |    6 | 1729.0    | 2.8    | 23.7  | 0.98x slower |
|   64 | 24576 |  1536 | 40 |    8 | 1766.0    | 2.7    | 23.2  | 0.97x slower |
|   64 | 32768 |   512 | 40 |    1 | 626.7     | 3.4    | 33.5  | 1.00x slower |
|   64 | 32768 |   512 | 40 |    6 | 965.4     | 2.2    | 21.8  | 0.95x slower |
|   64 | 32768 |   512 | 40 |    8 | 1000.3    | 2.1    | 21.0  | 0.95x slower |
|   64 |  7168 | 16384 | 40 |    1 | 3914.2    | 3.8    | 30.5  | 0.98x slower |
|   64 |  7168 | 16384 | 40 |    6 | 4851.7    | 3.1    | 24.6  | 0.97x slower |
|   64 |  7168 | 16384 | 40 |    8 | 4897.6    | 3.1    | 24.4  | 0.97x slower |
|   64 |  4096 |  7168 | 40 |    1 | 987.6     | 3.8    | 30.7  | 0.97x slower |
|   64 |  4096 |  7168 | 40 |    6 | 1302.7    | 2.9    | 23.3  | 0.96x slower |
|   64 |  4096 |  7168 | 40 |    8 | 1324.3    | 2.8    | 22.9  | 0.96x slower |
|   64 |  7168 |  2048 | 40 |    1 | 559.7     | 3.4    | 28.1  | 0.94x slower |
|   64 |  7168 |  2048 | 40 |    6 | 722.0     | 2.6    | 21.8  | 0.95x slower |
|   64 |  7168 |  2048 | 40 |    8 | 734.4     | 2.6    | 21.4  | 0.95x slower |
|  128 | 24576 |  1536 | 40 |    1 | 1582.9    | 6.1    | 27.9  | 0.97x slower |
|  128 | 24576 |  1536 | 40 |    6 | 1851.1    | 5.2    | 23.9  | 0.97x slower |
|  128 | 24576 |  1536 | 40 |    8 | 1915.8    | 5.0    | 23.1  | 0.96x slower |
|  128 | 32768 |   512 | 40 |    1 | 878.6     | 4.9    | 28.7  | 0.93x slower |
|  128 | 32768 |   512 | 40 |    6 | 1090.0    | 3.9    | 23.1  | 0.95x slower |
|  128 | 32768 |   512 | 40 |    8 | 1163.8    | 3.7    | 21.7  | 0.94x slower |
|  128 |  7168 | 16384 | 40 |    1 | 4414.5    | 6.8    | 27.5  | 0.97x slower |
|  128 |  7168 | 16384 | 40 |    6 | 4992.0    | 6.0    | 24.3  | 0.97x slower |
|  128 |  7168 | 16384 | 40 |    8 | 5087.7    | 5.9    | 23.9  | 0.97x slower |
|  128 |  4096 |  7168 | 40 |    1 | 1159.1    | 6.5    | 27.0  | 0.96x slower |
|  128 |  4096 |  7168 | 40 |    6 | 1375.5    | 5.5    | 22.8  | 0.95x slower |
|  128 |  4096 |  7168 | 40 |    8 | 1406.9    | 5.3    | 22.3  | 0.94x slower |
|  128 |  7168 |  2048 | 40 |    1 | 674.2     | 5.6    | 24.9  | 0.96x slower |
|  128 |  7168 |  2048 | 40 |    6 | 771.0     | 4.9    | 21.8  | 0.95x slower |
|  128 |  7168 |  2048 | 40 |    8 | 793.8     | 4.7    | 21.1  | 0.94x slower |
| 4096 | 24576 |  1536 | 40 |    1 | 2795.6    | 110.6  | 87.8  | 1.16x faster |
| 4096 | 24576 |  1536 | 40 |    6 | 13104.1   | 23.6   | 18.7  | 1.13x faster |
| 4096 | 24576 |  1536 | 40 |    8 | 17768.9   | 17.4   | 13.8  | 1.09x faster |
| 4096 | 32768 |   512 | 40 |    1 | 2201.8    | 62.4   | 130.5 | 1.13x faster |
| 4096 | 32768 |   512 | 40 |    6 | 10419.0   | 13.2   | 27.6  | 1.18x faster |
| 4096 | 32768 |   512 | 40 |    8 | 13843.7   | 9.9    | 20.8  | 1.16x faster |
| 4096 |  7168 | 16384 | 40 |    1 | 6469.8    | 148.7  | 37.6  | 1.24x faster |
| 4096 |  7168 | 16384 | 40 |    6 | 28696.3   | 33.5   | 8.5   | 1.14x faster |
| 4096 |  7168 | 16384 | 40 |    8 | 39607.4   | 24.3   | 6.1   | 1.07x faster |
| 4096 |  4096 |  7168 | 40 |    1 | 2063.4    | 116.6  | 44.7  | 1.10x faster |
| 4096 |  4096 |  7168 | 40 |    6 | 8889.7    | 27.1   | 10.4  | 1.03x faster |
| 4096 |  4096 |  7168 | 40 |    8 | 12188.7   | 19.7   | 7.6   | 0.98x slower |
| 4096 |  7168 |  2048 | 40 |    1 | 1189.2    | 101.1  | 68.8  | 1.06x faster |
| 4096 |  7168 |  2048 | 40 |    6 | 5581.2    | 21.5   | 14.7  | 0.98x slower |
| 4096 |  7168 |  2048 | 40 |    8 | 7335.9    | 16.4   | 11.1  | 0.96x slower |
+------+-------+-------+----+------+-----------+--------+-------+--------------+

===== AVERAGE PERFORMANCE =====
+----------------+------------+----------+---------------+
| Implementation | Avg TFLOPS | Avg GB/s | Avg Time (ms) |
+----------------+------------+----------+---------------+
| vLLM Triton    | 21.98      | 44.54    | 4.84          |
| vLLM CUTLASS   | 21.83      | 43.38    | 4.91          |
+----------------+------------+----------+---------------+

===== AVERAGE SPEEDUPS =====
+-----------------------------+--------------+
| Comparison                  | Speedup      |
+-----------------------------+--------------+
| vLLM CUTLASS vs vLLM Triton | 0.97x slower |
+-----------------------------+--------------+

Future PRs should involve improvements in speed for functions that preprocess data before running the MoE kernels.

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
…ate it yet

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Jun 23, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ElizaWszola, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances vLLM's quantization capabilities by integrating blocked FP8 CUTLASS MoE support for SM90 GPUs. This addition provides a more optimized path for running Mixture of Experts models with fine-grained FP8 quantization, improving efficiency and potentially performance on compatible hardware.

Highlights

  • New Feature: Blocked FP8 CUTLASS MoE: This PR introduces support for blocked FP8 quantization within the CUTLASS Mixture of Experts (MoE) implementation, specifically targeting SM90 (NVIDIA Hopper) architectures. This enables more efficient execution of MoE layers with block-wise FP8 weights and activations.
  • Core Kernel Implementation: A new CUDA kernel, grouped_mm_c3x_with_blockwise_scaling.cu, has been added. This kernel leverages CUTLASS 3.x to perform grouped matrix multiplications with block-wise FP8 scaling, which is crucial for the new quantization scheme.
  • API and Integration: New C++ functions (cutlass_moe_blockwise_mm, transpose_cutlass_moe_a_scales) and their Python bindings have been introduced to expose the blocked MoE functionality. The vllm FP8 quantization handler is updated to utilize this new CUTLASS implementation when supported by the hardware.
  • Comprehensive Testing: Extensive unit tests have been added to validate the new blocked FP8 MoE implementation, including tests for single grouped multiply operations and fused experts operations, ensuring correctness and performance.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds support for blocked FP8 CUTLASS MoE for SM90 architectures. The changes are comprehensive. My review focuses on correctness and maintainability. I've found a critical compilation issue due to a typo in a macro definition. Additionally, there are several areas with commented-out code, debug statements, and redundant checks that should be cleaned up to improve code quality before merging. The new tests are comprehensive and cover the new functionality well.

Comment on lines 120 to 121
TORCH_CHECK(a_tensors.dtype() == torch::kFloat8_e4m3fn);
TORCH_CHECK(b_tensors.dtype() == torch::kFloat8_e4m3fn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These TORCH_CHECKs are redundant. The same checks are performed on lines 115-118 with more descriptive error messages. Remove these duplicate lines.

Comment on lines 136 to 155
// if (n >= 8192) {
// cutlass_blockwise_group_gemm_caller<Cutlass3xGemmN8192>(
// out_tensors, a_tensors, b_tensors, a_scales, b_scales,
// expert_offsets, problem_sizes, a_strides, b_strides, c_strides,
// per_act_block);
// } else if (k >= 8192) {
// cutlass_blockwise_group_gemm_caller<Cutlass3xGemmK8192>(
// out_tensors, a_tensors, b_tensors, a_scales, b_scales,
// expert_offsets, problem_sizes, a_strides, b_strides, c_strides,
// per_act_block);
// } else if (m <= 16) {
// cutlass_blockwise_group_gemm_caller<Cutlass3xGemmM16>(
// out_tensors, a_tensors, b_tensors, a_scales, b_scales,
// expert_offsets, problem_sizes, a_strides, b_strides, c_strides,
// per_act_block,);
// } else {
cutlass_blockwise_group_gemm_caller<Cutlass3xGemmDefault>(
out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets,
problem_sizes, a_strides, b_strides, c_strides, per_act_block);
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of commented-out code appears to be an alternative dispatch logic based on problem dimensions (M, N, K). Since it's not currently used, it should be removed to keep the code clean.

Signed-off-by: ElizaWszola <ewszola@redhat.com>
@mergify mergify bot added the performance Performance-related issues label Jun 23, 2025
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
@ElizaWszola ElizaWszola marked this pull request as ready for review June 24, 2025 06:58
@mgoin mgoin changed the title Blocked fp8 CUTLASS MoE Blocked FP8 CUTLASS MoE for Hopper Jun 24, 2025
@djmmoss
Copy link
Contributor

djmmoss commented Jun 24, 2025

@ElizaWszola If you want to get your PR in first, I don't mind following up with the SM100 PR #19757 but modifying csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x_with_blockwise_scaling*, etc with the blackwell related changes. As far as I can tell this will also address #19757 (comment) and your comments correct?

Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's tune these kernels and see how they measure up to DeepGEMM!

torch::Tensor const& b_scales, torch::Tensor const& expert_offsets,
torch::Tensor const& problem_sizes, torch::Tensor const& a_strides,
torch::Tensor const& b_strides, torch::Tensor const& c_strides,
bool per_act_block);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does per_act_block mean? What does it mean if it's true and what does it mean if it's false? Please add some documentation

Copy link
Contributor Author

@ElizaWszola ElizaWszola Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it's true, this means [1x128]-block input scales. When it's false, we use per tensor scales - not sure if this is or will be needed, I can delete if it's too much extra code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to support both cases. But please add some comments documenting what the variable means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added relevant comments in torch_bindings.cpp and cutlass_moe.py.

Comment on lines 124 to 129
if per_act_block:
a1q_scale_t = torch.empty((a1q_scale.shape[0] * a1q_scale.shape[1]),
device=device,
dtype=a1q_scale.dtype)
ops.transpose_cutlass_moe_a_scales(a1q_scale_t, a1q_scale,
expert_offsets, problem_sizes1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you run the pytorch profiler to see how much time this takes?

Would the code be cleaner if ops.transpose_cutlass_moe_a_scales returned a1q_scale_t rather than requiring the caller to allocate it with torch.empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've profiled the function with a few different sizes and the both transpose function together take only slightly more time than the quantization of intermediate results (a2q, a2q_scale = ops.scaled_fp8_quant(...)). So for large enough inputs, this is an order of magnitude less than the kernel runtimes.

Copy link

mergify bot commented Jun 24, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: ElizaWszola <ewszola@redhat.com>
@mergify mergify bot removed the needs-rebase label Jun 25, 2025
@ElizaWszola
Copy link
Contributor Author

as far as I can tell this will also address #19757 (comment) and your comments correct?

@djmmoss Yes, this should address them. This PR might land a bit late though - I still have to do a bit of benchmarking and possibly add a bunch of kernel configs for performance

Signed-off-by: ElizaWszola <ewszola@redhat.com>
@mergify mergify bot removed the needs-rebase label Jul 22, 2025
Copy link

mergify bot commented Jul 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 22, 2025
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
@mergify mergify bot removed the needs-rebase label Jul 28, 2025
@@ -637,13 +654,14 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif()

cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
if(VLLM_COMPILE_FP8_BLOCKWISE_CUTLASS_MOE AND ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to guard against the existing sm100 kernels here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I'm waiting for the input from the SM100 kernel's author. I e2e-benchmarked CUTLASS vs. Triton on a SM100 machine and CUTLASS was slower, but I would like them to confirm that CUTLASS is slower than Triton also with their setup.

Copy link

mergify bot commented Jul 29, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 29, 2025
Copy link
Collaborator

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!

@@ -598,6 +598,55 @@ def get_w8a8_block_fp8_configs(N: int, K: int, block_n: int,
return None


# Copied and adapted from
# https://github.com/deepseek-ai/DeepGEMM/blob/78cacf70d41d15d688bd493ebc85845f7f2a3d5d/tests/test_core.py#L17
def per_block_cast_to_fp8(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reuse other per_block_cast_to_fp8 , similar to #21787

assert x.dim() == 2
m, n = x.shape

def ceil_div(x: int, y: int) -> int:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use vllm utils cdiv please

Comment on lines +625 to +647
def native_per_token_group_quant_fp8(x,
group_size,
eps=1e-10,
dtype=torch.float8_e4m3fn):
"""Function to perform per-token-group quantization on an input tensor
`x` using native torch."""
assert x.shape[-1] % group_size == 0, ("the last dimension of `x` cannot "
"be divisible by `group_size`")
assert x.is_contiguous(), "`x` is not contiguous"

finfo = torch.finfo(dtype)
fp8_min = finfo.min
fp8_max = finfo.max

x_ = x.reshape(x.numel() // group_size, group_size)
amax = x_.abs().max(dim=-1,
keepdim=True)[0].clamp(min=eps).to(torch.float32)
x_s = amax / fp8_max
x_q = (x_ / x_s).clamp(min=fp8_min, max=fp8_max).to(dtype)
x_q = x_q.reshape(x.shape)
x_s = x_s.reshape(x.shape[:-1] + (x.shape[-1] // group_size, ))

return x_q, x_s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this instead of cuda kernel per token group quant?

Comment on lines 194 to 212
const float* a_scales,
const int32_t* expert_offsets,
const int32_t* problem_sizes,
int64_t k_scaled) {
int64_t expert_idx = blockIdx.x;
int64_t start_k_scaled = threadIdx.x;
int64_t step_k_scaled = blockDim.x;
int64_t expert_offset = expert_offsets[expert_idx];
int64_t num_tokens = problem_sizes[expert_idx * 3];
int64_t expert_offset_scaled = expert_offset * k_scaled;

for (int64_t t = 0; t < num_tokens; ++t) {
for (int64_t k = start_k_scaled; k < k_scaled; k += step_k_scaled) {
a_scales_t[expert_offset_scaled + k * num_tokens + t] =
a_scales[expert_offset_scaled + t * k_scaled + k];
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pure for scalar would be slow.
Some thoughts that we could optimize this:

  1. vectorization using vectorize_with_alignment
  2. shared memory tile

Comment on lines 120 to 122

// Swap-AB should be disabled for FP4 path
bool may_swap_ab = (!blockscale_offsets.has_value()) &&
(topk_ids.numel() <= SWAP_AB_THRESHOLD);

if (may_swap_ab) {
bool swap_ab = !force_no_swap && topk_ids.numel() <= SWAP_AB_THRESHOLD;
if (swap_ab) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not using blockscale_offsets.has_value(), we should make sure that nvfp4 path will pass in the force_no_swap

using ElementScale = typename Gemm::ElementScale;
using ScaleConfig = typename Gemm::ScaleConfig;
using LayoutSFA = typename Gemm::LayoutSFA;
using LayoutSFB = typename Gemm::LayoutSFB;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe setting the correct LayoutSFA could avoid the transpose of a_scales in cutlass_moe.py ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks! Will look into this tomorrow.

# Get the right scale for tests.
if per_act_block:
a_q, a_scale = per_token_group_quant_fp8(moe_tensors_fp16.a,
block_size[1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transpose here ?

Signed-off-by: ElizaWszola <ewszola@redhat.com>
@mergify mergify bot removed the needs-rebase label Aug 21, 2025
ElizaWszola and others added 4 commits August 21, 2025 12:22
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: elvircrn <elvircrn@gmail.com>
Signed-off-by: elvircrn <elvircrn@gmail.com>
Signed-off-by: elvircrn <elvircrn@gmail.com>
@elvircrn
Copy link
Contributor

elvircrn commented Aug 21, 2025

These are the latest benchmark numbers of H100:

vLLM Triton Implementation:
+------+-------+-------+----+------+-----------+--------+-------+
| m    | n     | k     | e  | topk | Time (μs) | TFLOPS | GB/s  |
+------+-------+-------+----+------+-----------+--------+-------+
|   64 | 24576 |  1536 |  8 |    1 | 407.3     | 11.9   | 100.6 |
|   64 | 24576 |  1536 |  8 |    6 | 471.4     | 10.2   | 87.0  |
|   64 | 24576 |  1536 |  8 |    8 | 494.7     | 9.8    | 82.9  |
|   64 | 32768 |   512 |  8 |    1 | 302.7     | 7.1    | 69.4  |
|   64 | 32768 |   512 |  8 |    6 | 363.1     | 5.9    | 57.8  |
|   64 | 32768 |   512 |  8 |    8 | 382.8     | 5.6    | 54.9  |
|   64 |  7168 | 16384 |  8 |    1 | 995.6     | 15.1   | 119.9 |
|   64 |  7168 | 16384 |  8 |    6 | 1043.1    | 14.4   | 114.5 |
|   64 |  7168 | 16384 |  8 |    8 | 1070.1    | 14.0   | 111.6 |
|   64 |  4096 |  7168 |  8 |    1 | 282.6     | 13.3   | 107.4 |
|   64 |  4096 |  7168 |  8 |    6 | 299.8     | 12.5   | 101.2 |
|   64 |  4096 |  7168 |  8 |    8 | 308.5     | 12.2   | 98.4  |
|   64 |  7168 |  2048 |  8 |    1 | 262.3     | 7.2    | 60.0  |
|   64 |  7168 |  2048 |  8 |    6 | 248.1     | 7.6    | 63.4  |
|   64 |  7168 |  2048 |  8 |    8 | 236.4     | 7.9    | 66.5  |
|  128 | 24576 |  1536 |  8 |    1 | 417.3     | 23.2   | 106.0 |
|  128 | 24576 |  1536 |  8 |    6 | 621.1     | 15.6   | 71.2  |
|  128 | 24576 |  1536 |  8 |    8 | 660.1     | 14.6   | 67.0  |
|  128 | 32768 |   512 |  8 |    1 | 313.6     | 13.7   | 80.5  |
|  128 | 32768 |   512 |  8 |    6 | 494.6     | 8.7    | 51.0  |
|  128 | 32768 |   512 |  8 |    8 | 533.1     | 8.1    | 47.3  |
|  128 |  7168 | 16384 |  8 |    1 | 1011.2    | 29.7   | 120.0 |
|  128 |  7168 | 16384 |  8 |    6 | 1428.4    | 21.0   | 85.0  |
|  128 |  7168 | 16384 |  8 |    8 | 1487.3    | 20.2   | 81.6  |
|  128 |  4096 |  7168 |  8 |    1 | 289.9     | 25.9   | 108.1 |
|  128 |  4096 |  7168 |  8 |    6 | 410.5     | 18.3   | 76.3  |
|  128 |  4096 |  7168 |  8 |    8 | 420.8     | 17.9   | 74.4  |
|  128 |  7168 |  2048 |  8 |    1 | 258.6     | 14.5   | 64.9  |
|  128 |  7168 |  2048 |  8 |    6 | 245.1     | 15.3   | 68.5  |
|  128 |  7168 |  2048 |  8 |    8 | 253.2     | 14.8   | 66.3  |
| 4096 | 24576 |  1536 |  8 |    1 | 2359.2    | 131.1  | 104.0 |
| 4096 | 24576 |  1536 |  8 |    6 | 12536.7   | 24.7   | 19.6  |
| 4096 | 24576 |  1536 |  8 |    8 | 16447.7   | 18.8   | 14.9  |
| 4096 | 32768 |   512 |  8 |    1 | 1773.8    | 77.5   | 162.0 |
| 4096 | 32768 |   512 |  8 |    6 | 9062.2    | 15.2   | 31.7  |
| 4096 | 32768 |   512 |  8 |    8 | 11928.6   | 11.5   | 24.1  |
| 4096 |  7168 | 16384 |  8 |    1 | 5737.8    | 167.7  | 42.4  |
| 4096 |  7168 | 16384 |  8 |    6 | 29182.1   | 33.0   | 8.3   |
| 4096 |  7168 | 16384 |  8 |    8 | 37809.0   | 25.4   | 6.4   |
| 4096 |  4096 |  7168 |  8 |    1 | 1593.2    | 151.0  | 57.9  |
| 4096 |  4096 |  7168 |  8 |    6 | 7795.3    | 30.9   | 11.8  |
| 4096 |  4096 |  7168 |  8 |    8 | 10116.5   | 23.8   | 9.1   |
| 4096 |  7168 |  2048 |  8 |    1 | 899.7     | 133.7  | 90.9  |
| 4096 |  7168 |  2048 |  8 |    6 | 4598.5    | 26.2   | 17.8  |
| 4096 |  7168 |  2048 |  8 |    8 | 6032.2    | 19.9   | 13.6  |
|   64 | 24576 |  1536 | 40 |    1 | 1332.0    | 3.6    | 30.8  |
|   64 | 24576 |  1536 | 40 |    6 | 1645.2    | 2.9    | 24.9  |
|   64 | 24576 |  1536 | 40 |    8 | 1667.9    | 2.9    | 24.6  |
|   64 | 32768 |   512 | 40 |    1 | 610.8     | 3.5    | 34.4  |
|   64 | 32768 |   512 | 40 |    6 | 862.5     | 2.5    | 24.4  |
|   64 | 32768 |   512 | 40 |    8 | 880.3     | 2.4    | 23.9  |
|   64 |  7168 | 16384 | 40 |    1 | 3804.6    | 4.0    | 31.4  |
|   64 |  7168 | 16384 | 40 |    6 | 4680.1    | 3.2    | 25.5  |
|   64 |  7168 | 16384 | 40 |    8 | 4710.0    | 3.2    | 25.4  |
|   64 |  4096 |  7168 | 40 |    1 | 946.9     | 4.0    | 32.0  |
|   64 |  4096 |  7168 | 40 |    6 | 1227.4    | 3.1    | 24.7  |
|   64 |  4096 |  7168 | 40 |    8 | 1237.7    | 3.0    | 24.5  |
|   64 |  7168 |  2048 | 40 |    1 | 512.7     | 3.7    | 30.7  |
|   64 |  7168 |  2048 | 40 |    6 | 663.2     | 2.8    | 23.7  |
|   64 |  7168 |  2048 | 40 |    8 | 673.6     | 2.8    | 23.3  |
|  128 | 24576 |  1536 | 40 |    1 | 1511.3    | 6.4    | 29.3  |
|  128 | 24576 |  1536 | 40 |    6 | 1717.5    | 5.6    | 25.8  |
|  128 | 24576 |  1536 | 40 |    8 | 1752.3    | 5.5    | 25.2  |
|  128 | 32768 |   512 | 40 |    1 | 790.7     | 5.4    | 31.9  |
|  128 | 32768 |   512 | 40 |    6 | 937.4     | 4.6    | 26.9  |
|  128 | 32768 |   512 | 40 |    8 | 975.3     | 4.4    | 25.9  |
|  128 |  7168 | 16384 | 40 |    1 | 4284.4    | 7.0    | 28.3  |
|  128 |  7168 | 16384 | 40 |    6 | 4779.1    | 6.3    | 25.4  |
|  128 |  7168 | 16384 | 40 |    8 | 4820.6    | 6.2    | 25.2  |
|  128 |  4096 |  7168 | 40 |    1 | 1093.6    | 6.9    | 28.6  |
|  128 |  4096 |  7168 | 40 |    6 | 1259.0    | 6.0    | 24.9  |
|  128 |  4096 |  7168 | 40 |    8 | 1268.2    | 5.9    | 24.7  |
|  128 |  7168 |  2048 | 40 |    1 | 631.4     | 6.0    | 26.6  |
|  128 |  7168 |  2048 | 40 |    6 | 692.6     | 5.4    | 24.2  |
|  128 |  7168 |  2048 | 40 |    8 | 699.0     | 5.4    | 24.0  |
| 4096 | 24576 |  1536 | 40 |    1 | 2994.1    | 103.3  | 81.9  |
| 4096 | 24576 |  1536 | 40 |    6 | 13316.6   | 23.2   | 18.4  |
| 4096 | 24576 |  1536 | 40 |    8 | 17419.5   | 17.8   | 14.1  |
| 4096 | 32768 |   512 | 40 |    1 | 2060.1    | 66.7   | 139.5 |
| 4096 | 32768 |   512 | 40 |    6 | 9657.0    | 14.2   | 29.8  |
| 4096 | 32768 |   512 | 40 |    8 | 12646.8   | 10.9   | 22.7  |
| 4096 |  7168 | 16384 | 40 |    1 | 7596.6    | 126.6  | 32.0  |
| 4096 |  7168 | 16384 | 40 |    6 | 31141.1   | 30.9   | 7.8   |
| 4096 |  7168 | 16384 | 40 |    8 | 40627.5   | 23.7   | 6.0   |
| 4096 |  4096 |  7168 | 40 |    1 | 2087.6    | 115.2  | 44.2  |
| 4096 |  4096 |  7168 | 40 |    6 | 8375.8    | 28.7   | 11.0  |
| 4096 |  4096 |  7168 | 40 |    8 | 10947.7   | 22.0   | 8.4   |
| 4096 |  7168 |  2048 | 40 |    1 | 1142.4    | 105.3  | 71.6  |
| 4096 |  7168 |  2048 | 40 |    6 | 4892.9    | 24.6   | 16.7  |
| 4096 |  7168 |  2048 | 40 |    8 | 6343.5    | 19.0   | 12.9  |
+------+-------+-------+----+------+-----------+--------+-------+

vLLM CUTLASS Implementation:
+------+-------+-------+----+------+-----------+--------+-------+--------------+
| m    | n     | k     | e  | topk | Time (μs) | TFLOPS | GB/s  | vs Triton    |
+------+-------+-------+----+------+-----------+--------+-------+--------------+
|   64 | 24576 |  1536 |  8 |    1 | 409.8     | 11.8   | 100.0 | 0.99x slower |
|   64 | 24576 |  1536 |  8 |    6 | 458.3     | 10.5   | 89.4  | 1.03x faster |
|   64 | 24576 |  1536 |  8 |    8 | 478.5     | 10.1   | 85.7  | 1.03x faster |
|   64 | 32768 |   512 |  8 |    1 | 288.0     | 7.5    | 72.9  | 1.05x faster |
|   64 | 32768 |   512 |  8 |    6 | 336.4     | 6.4    | 62.4  | 1.08x faster |
|   64 | 32768 |   512 |  8 |    8 | 355.3     | 6.0    | 59.1  | 1.08x faster |
|   64 |  7168 | 16384 |  8 |    1 | 1041.4    | 14.4   | 114.7 | 0.96x slower |
|   64 |  7168 | 16384 |  8 |    6 | 1119.8    | 13.4   | 106.6 | 0.93x slower |
|   64 |  7168 | 16384 |  8 |    8 | 1150.9    | 13.1   | 103.7 | 0.93x slower |
|   64 |  4096 |  7168 |  8 |    1 | 323.0     | 11.6   | 93.9  | 0.87x slower |
|   64 |  4096 |  7168 |  8 |    6 | 361.5     | 10.4   | 83.9  | 0.83x slower |
|   64 |  4096 |  7168 |  8 |    8 | 378.1     | 9.9    | 80.3  | 0.82x slower |
|   64 |  7168 |  2048 |  8 |    1 | 226.0     | 8.3    | 69.6  | 1.16x faster |
|   64 |  7168 |  2048 |  8 |    6 | 221.4     | 8.5    | 71.1  | 1.12x faster |
|   64 |  7168 |  2048 |  8 |    8 | 231.1     | 8.1    | 68.1  | 1.02x faster |
|  128 | 24576 |  1536 |  8 |    1 | 417.4     | 23.1   | 106.0 | 1.00x slower |
|  128 | 24576 |  1536 |  8 |    6 | 545.7     | 17.7   | 81.1  | 1.14x faster |
|  128 | 24576 |  1536 |  8 |    8 | 591.1     | 16.3   | 74.8  | 1.12x faster |
|  128 | 32768 |   512 |  8 |    1 | 292.1     | 14.7   | 86.4  | 1.07x faster |
|  128 | 32768 |   512 |  8 |    6 | 397.7     | 10.8   | 63.4  | 1.24x faster |
|  128 | 32768 |   512 |  8 |    8 | 455.6     | 9.4    | 55.4  | 1.17x faster |
|  128 |  7168 | 16384 |  8 |    1 | 1064.5    | 28.2   | 114.0 | 0.95x slower |
|  128 |  7168 | 16384 |  8 |    6 | 1281.2    | 23.5   | 94.7  | 1.11x faster |
|  128 |  7168 | 16384 |  8 |    8 | 1366.8    | 22.0   | 88.8  | 1.09x faster |
|  128 |  4096 |  7168 |  8 |    1 | 331.2     | 22.7   | 94.6  | 0.88x slower |
|  128 |  4096 |  7168 |  8 |    6 | 419.1     | 17.9   | 74.8  | 0.98x slower |
|  128 |  4096 |  7168 |  8 |    8 | 448.9     | 16.7   | 69.8  | 0.94x slower |
|  128 |  7168 |  2048 |  8 |    1 | 221.6     | 17.0   | 75.7  | 1.17x faster |
|  128 |  7168 |  2048 |  8 |    6 | 256.4     | 14.7   | 65.4  | 0.96x slower |
|  128 |  7168 |  2048 |  8 |    8 | 276.4     | 13.6   | 60.7  | 0.92x slower |
| 4096 | 24576 |  1536 |  8 |    1 | 1942.6    | 159.2  | 126.3 | 1.21x faster |
| 4096 | 24576 |  1536 |  8 |    6 | 11088.5   | 27.9   | 22.1  | 1.13x faster |
| 4096 | 24576 |  1536 |  8 |    8 | 14689.3   | 21.1   | 16.7  | 1.12x faster |
| 4096 | 32768 |   512 |  8 |    1 | 1407.2    | 97.7   | 204.2 | 1.26x faster |
| 4096 | 32768 |   512 |  8 |    6 | 7827.6    | 17.6   | 36.7  | 1.16x faster |
| 4096 | 32768 |   512 |  8 |    8 | 10348.8   | 13.3   | 27.8  | 1.15x faster |
| 4096 |  7168 | 16384 |  8 |    1 | 5325.6    | 180.6  | 45.7  | 1.08x faster |
| 4096 |  7168 | 16384 |  8 |    6 | 30734.1   | 31.3   | 7.9   | 0.95x slower |
| 4096 |  7168 | 16384 |  8 |    8 | 41547.3   | 23.2   | 5.9   | 0.91x slower |
| 4096 |  4096 |  7168 |  8 |    1 | 1714.9    | 140.2  | 53.8  | 0.93x slower |
| 4096 |  4096 |  7168 |  8 |    6 | 8447.6    | 28.5   | 10.9  | 0.92x slower |
| 4096 |  4096 |  7168 |  8 |    8 | 11113.3   | 21.6   | 8.3   | 0.91x slower |
| 4096 |  7168 |  2048 |  8 |    1 | 881.9     | 136.4  | 92.7  | 1.02x faster |
| 4096 |  7168 |  2048 |  8 |    6 | 4457.6    | 27.0   | 18.3  | 1.03x faster |
| 4096 |  7168 |  2048 |  8 |    8 | 5798.6    | 20.7   | 14.1  | 1.04x faster |
|   64 | 24576 |  1536 | 40 |    1 | 1339.7    | 3.6    | 30.6  | 0.99x slower |
|   64 | 24576 |  1536 | 40 |    6 | 1687.4    | 2.9    | 24.3  | 0.97x slower |
|   64 | 24576 |  1536 | 40 |    8 | 1718.6    | 2.8    | 23.9  | 0.97x slower |
|   64 | 32768 |   512 | 40 |    1 | 612.2     | 3.5    | 34.3  | 1.00x slower |
|   64 | 32768 |   512 | 40 |    6 | 922.3     | 2.3    | 22.8  | 0.94x slower |
|   64 | 32768 |   512 | 40 |    8 | 940.0     | 2.3    | 22.3  | 0.94x slower |
|   64 |  7168 | 16384 | 40 |    1 | 3890.4    | 3.9    | 30.7  | 0.98x slower |
|   64 |  7168 | 16384 | 40 |    6 | 4829.6    | 3.1    | 24.7  | 0.97x slower |
|   64 |  7168 | 16384 | 40 |    8 | 4896.4    | 3.1    | 24.4  | 0.96x slower |
|   64 |  4096 |  7168 | 40 |    1 | 981.4     | 3.8    | 30.9  | 0.96x slower |
|   64 |  4096 |  7168 | 40 |    6 | 1292.4    | 2.9    | 23.5  | 0.95x slower |
|   64 |  4096 |  7168 | 40 |    8 | 1317.8    | 2.9    | 23.0  | 0.94x slower |
|   64 |  7168 |  2048 | 40 |    1 | 557.3     | 3.4    | 28.2  | 0.92x slower |
|   64 |  7168 |  2048 | 40 |    6 | 699.1     | 2.7    | 22.5  | 0.95x slower |
|   64 |  7168 |  2048 | 40 |    8 | 714.1     | 2.6    | 22.0  | 0.94x slower |
|  128 | 24576 |  1536 | 40 |    1 | 1559.5    | 6.2    | 28.4  | 0.97x slower |
|  128 | 24576 |  1536 | 40 |    6 | 1769.9    | 5.5    | 25.0  | 0.97x slower |
|  128 | 24576 |  1536 | 40 |    8 | 1819.7    | 5.3    | 24.3  | 0.96x slower |
|  128 | 32768 |   512 | 40 |    1 | 854.1     | 5.0    | 29.5  | 0.93x slower |
|  128 | 32768 |   512 | 40 |    6 | 984.4     | 4.4    | 25.6  | 0.95x slower |
|  128 | 32768 |   512 | 40 |    8 | 1037.2    | 4.1    | 24.3  | 0.94x slower |
|  128 |  7168 | 16384 | 40 |    1 | 4405.5    | 6.8    | 27.6  | 0.97x slower |
|  128 |  7168 | 16384 | 40 |    6 | 5001.8    | 6.0    | 24.3  | 0.96x slower |
|  128 |  7168 | 16384 | 40 |    8 | 5106.4    | 5.9    | 23.8  | 0.94x slower |
|  128 |  4096 |  7168 | 40 |    1 | 1144.6    | 6.6    | 27.4  | 0.96x slower |
|  128 |  4096 |  7168 | 40 |    6 | 1354.9    | 5.5    | 23.1  | 0.93x slower |
|  128 |  4096 |  7168 | 40 |    8 | 1396.2    | 5.4    | 22.4  | 0.91x slower |
|  128 |  7168 |  2048 | 40 |    1 | 661.4     | 5.7    | 25.4  | 0.95x slower |
|  128 |  7168 |  2048 | 40 |    6 | 740.2     | 5.1    | 22.7  | 0.94x slower |
|  128 |  7168 |  2048 | 40 |    8 | 758.5     | 5.0    | 22.1  | 0.92x slower |
| 4096 | 24576 |  1536 | 40 |    1 | 2485.9    | 124.4  | 98.7  | 1.20x faster |
| 4096 | 24576 |  1536 | 40 |    6 | 11420.4   | 27.1   | 21.5  | 1.17x faster |
| 4096 | 24576 |  1536 | 40 |    8 | 15214.7   | 20.3   | 16.1  | 1.14x faster |
| 4096 | 32768 |   512 | 40 |    1 | 1739.3    | 79.0   | 165.2 | 1.18x faster |
| 4096 | 32768 |   512 | 40 |    6 | 8030.8    | 17.1   | 35.8  | 1.20x faster |
| 4096 | 32768 |   512 | 40 |    8 | 10683.7   | 12.9   | 26.9  | 1.18x faster |
| 4096 |  7168 | 16384 | 40 |    1 | 6444.2    | 149.3  | 37.7  | 1.18x faster |
| 4096 |  7168 | 16384 | 40 |    6 | 29620.2   | 32.5   | 8.2   | 1.05x faster |
| 4096 |  7168 | 16384 | 40 |    8 | 40327.5   | 23.9   | 6.0   | 1.01x faster |
| 4096 |  4096 |  7168 | 40 |    1 | 1987.1    | 121.0  | 46.4  | 1.05x faster |
| 4096 |  4096 |  7168 | 40 |    6 | 8926.0    | 26.9   | 10.3  | 0.94x slower |
| 4096 |  4096 |  7168 | 40 |    8 | 12302.4   | 19.6   | 7.5   | 0.89x slower |
| 4096 |  7168 |  2048 | 40 |    1 | 1049.9    | 114.5  | 77.9  | 1.09x faster |
| 4096 |  7168 |  2048 | 40 |    6 | 4952.8    | 24.3   | 16.5  | 0.99x slower |
| 4096 |  7168 |  2048 | 40 |    8 | 6488.8    | 18.5   | 12.6  | 0.98x slower |
+------+-------+-------+----+------+-----------+--------+-------+--------------+

===== AVERAGE PERFORMANCE =====
+----------------+------------+----------+---------------+
| Implementation | Avg TFLOPS | Avg GB/s | Avg Time (ms) |
+----------------+------------+----------+---------------+
| vLLM Triton    | 24.15      | 48.85    | 4.42          |
| vLLM CUTLASS   | 25.60      | 50.11    | 4.33          |
+----------------+------------+----------+---------------+

===== AVERAGE SPEEDUPS =====
+-----------------------------+--------------+
| Comparison                  | Speedup      |
+-----------------------------+--------------+
| vLLM CUTLASS vs vLLM Triton | 1.01x faster |
+-----------------------------+--------------+

===== ACCURACY COMPARISON =====
+----------------+-----------------------+
| Implementation | Avg Diff vs Reference |
+----------------+-----------------------+
| vLLM Triton    | 0.000294              |
| vLLM CUTLASS   | 0.000295              |
+----------------+-----------------------+

Copy link

mergify bot commented Aug 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 23, 2025
Signed-off-by: Elvir Crnčević <elvircrn@gmail.com>
@mergify mergify bot removed the needs-rebase label Aug 26, 2025
Copy link

mergify bot commented Aug 30, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build needs-rebase performance Performance-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants