[Feature][Kernel] Blocked FP8 CUTLASS MoE for Hopper #19983

ElizaWszola · 2025-06-23T12:43:32Z

Add support for blocked fp8 CUTLASS MoE for SM90.

Testing:

Single grouped multiply unit tests:

pytest tests/kernels/quantization/test_cutlass_scaled_mm.py -k test_cutlass_fp8_group_gemm

Fused experts op unit tests:

pytest tests/kernels/moe/test_cutlass_moe.py -k test_blocked_cutlass_moe_8

LMM for offline inference test:

llm = LLM(model="Qwen/Qwen3-30B-A3B-FP8")

Performance:

Currenlty, blocked CUTLASS is slightly worse than Triton on average, but beneficial for some shapes:

vLLM CUTLASS Implementation:
+------+-------+-------+----+------+-----------+--------+-------+--------------+
| m    | n     | k     | e  | topk | Time (μs) | TFLOPS | GB/s  | vs Triton    |
+------+-------+-------+----+------+-----------+--------+-------+--------------+
|   64 | 24576 |  1536 |  8 |    1 | 423.5     | 11.4   | 96.8  | 1.00x slower |
|   64 | 24576 |  1536 |  8 |    6 | 528.9     | 9.1    | 77.5  | 0.98x slower |
|   64 | 24576 |  1536 |  8 |    8 | 566.3     | 8.5    | 72.4  | 0.97x slower |
|   64 | 32768 |   512 |  8 |    1 | 307.2     | 7.0    | 68.4  | 1.04x faster |
|   64 | 32768 |   512 |  8 |    6 | 419.6     | 5.1    | 50.1  | 1.00x slower |
|   64 | 32768 |   512 |  8 |    8 | 469.4     | 4.6    | 44.7  | 0.96x slower |
|   64 |  7168 | 16384 |  8 |    1 | 1061.5    | 14.2   | 112.5 | 0.95x slower |
|   64 |  7168 | 16384 |  8 |    6 | 1142.3    | 13.2   | 104.5 | 0.95x slower |
|   64 |  7168 | 16384 |  8 |    8 | 1174.9    | 12.8   | 101.6 | 0.96x slower |
|   64 |  4096 |  7168 |  8 |    1 | 338.8     | 11.1   | 89.6  | 0.90x slower |
|   64 |  4096 |  7168 |  8 |    6 | 393.6     | 9.5    | 77.1  | 0.83x slower |
|   64 |  4096 |  7168 |  8 |    8 | 411.9     | 9.1    | 73.7  | 0.83x slower |
|   64 |  7168 |  2048 |  8 |    1 | 284.2     | 6.6    | 55.3  | 1.06x faster |
|   64 |  7168 |  2048 |  8 |    6 | 286.2     | 6.6    | 55.0  | 1.07x faster |
|   64 |  7168 |  2048 |  8 |    8 | 286.6     | 6.6    | 54.9  | 1.07x faster |
|  128 | 24576 |  1536 |  8 |    1 | 449.6     | 21.5   | 98.4  | 0.98x slower |
|  128 | 24576 |  1536 |  8 |    6 | 703.3     | 13.7   | 62.9  | 0.97x slower |
|  128 | 24576 |  1536 |  8 |    8 | 801.2     | 12.1   | 55.2  | 0.92x slower |
|  128 | 32768 |   512 |  8 |    1 | 326.0     | 13.2   | 77.4  | 1.03x faster |
|  128 | 32768 |   512 |  8 |    6 | 574.8     | 7.5    | 43.9  | 1.03x faster |
|  128 | 32768 |   512 |  8 |    8 | 688.3     | 6.2    | 36.7  | 0.95x slower |
|  128 |  7168 | 16384 |  8 |    1 | 1089.1    | 27.6   | 111.4 | 0.95x slower |
|  128 |  7168 | 16384 |  8 |    6 | 1300.5    | 23.1   | 93.3  | 1.14x faster |
|  128 |  7168 | 16384 |  8 |    8 | 1385.6    | 21.7   | 87.6  | 1.12x faster |
|  128 |  4096 |  7168 |  8 |    1 | 350.5     | 21.4   | 89.4  | 0.87x slower |
|  128 |  4096 |  7168 |  8 |    6 | 475.8     | 15.8   | 65.8  | 0.94x slower |
|  128 |  4096 |  7168 |  8 |    8 | 513.0     | 14.7   | 61.1  | 0.91x slower |
|  128 |  7168 |  2048 |  8 |    1 | 285.4     | 13.2   | 58.8  | 1.05x faster |
|  128 |  7168 |  2048 |  8 |    6 | 325.1     | 11.6   | 51.6  | 0.93x slower |
|  128 |  7168 |  2048 |  8 |    8 | 363.3     | 10.3   | 46.2  | 0.84x slower |
| 4096 | 24576 |  1536 |  8 |    1 | 2810.2    | 110.0  | 87.3  | 0.93x slower |
| 4096 | 24576 |  1536 |  8 |    6 | 16116.9   | 19.2   | 15.2  | 0.87x slower |
| 4096 | 24576 |  1536 |  8 |    8 | 21638.2   | 14.3   | 11.3  | 0.85x slower |
| 4096 | 32768 |   512 |  8 |    1 | 2321.9    | 59.2   | 123.7 | 0.96x slower |
| 4096 | 32768 |   512 |  8 |    6 | 12800.4   | 10.7   | 22.4  | 0.91x slower |
| 4096 | 32768 |   512 |  8 |    8 | 17160.9   | 8.0    | 16.7  | 0.89x slower |
| 4096 |  7168 | 16384 |  8 |    1 | 5572.2    | 172.7  | 43.7  | 1.08x faster |
| 4096 |  7168 | 16384 |  8 |    6 | 31637.9   | 30.4   | 7.7   | 0.97x slower |
| 4096 |  7168 | 16384 |  8 |    8 | 42752.4   | 22.5   | 5.7   | 0.94x slower |
| 4096 |  4096 |  7168 |  8 |    1 | 2096.5    | 114.7  | 44.0  | 0.84x slower |
| 4096 |  4096 |  7168 |  8 |    6 | 10337.4   | 23.3   | 8.9   | 0.83x slower |
| 4096 |  4096 |  7168 |  8 |    8 | 13602.7   | 17.7   | 6.8   | 0.82x slower |
| 4096 |  7168 |  2048 |  8 |    1 | 1317.6    | 91.3   | 62.1  | 0.77x slower |
| 4096 |  7168 |  2048 |  8 |    6 | 6860.7    | 17.5   | 11.9  | 0.75x slower |
| 4096 |  7168 |  2048 |  8 |    8 | 9054.1    | 13.3   | 9.0   | 0.75x slower |
|   64 | 24576 |  1536 | 40 |    1 | 1362.0    | 3.5    | 30.1  | 0.99x slower |
|   64 | 24576 |  1536 | 40 |    6 | 1729.0    | 2.8    | 23.7  | 0.98x slower |
|   64 | 24576 |  1536 | 40 |    8 | 1766.0    | 2.7    | 23.2  | 0.97x slower |
|   64 | 32768 |   512 | 40 |    1 | 626.7     | 3.4    | 33.5  | 1.00x slower |
|   64 | 32768 |   512 | 40 |    6 | 965.4     | 2.2    | 21.8  | 0.95x slower |
|   64 | 32768 |   512 | 40 |    8 | 1000.3    | 2.1    | 21.0  | 0.95x slower |
|   64 |  7168 | 16384 | 40 |    1 | 3914.2    | 3.8    | 30.5  | 0.98x slower |
|   64 |  7168 | 16384 | 40 |    6 | 4851.7    | 3.1    | 24.6  | 0.97x slower |
|   64 |  7168 | 16384 | 40 |    8 | 4897.6    | 3.1    | 24.4  | 0.97x slower |
|   64 |  4096 |  7168 | 40 |    1 | 987.6     | 3.8    | 30.7  | 0.97x slower |
|   64 |  4096 |  7168 | 40 |    6 | 1302.7    | 2.9    | 23.3  | 0.96x slower |
|   64 |  4096 |  7168 | 40 |    8 | 1324.3    | 2.8    | 22.9  | 0.96x slower |
|   64 |  7168 |  2048 | 40 |    1 | 559.7     | 3.4    | 28.1  | 0.94x slower |
|   64 |  7168 |  2048 | 40 |    6 | 722.0     | 2.6    | 21.8  | 0.95x slower |
|   64 |  7168 |  2048 | 40 |    8 | 734.4     | 2.6    | 21.4  | 0.95x slower |
|  128 | 24576 |  1536 | 40 |    1 | 1582.9    | 6.1    | 27.9  | 0.97x slower |
|  128 | 24576 |  1536 | 40 |    6 | 1851.1    | 5.2    | 23.9  | 0.97x slower |
|  128 | 24576 |  1536 | 40 |    8 | 1915.8    | 5.0    | 23.1  | 0.96x slower |
|  128 | 32768 |   512 | 40 |    1 | 878.6     | 4.9    | 28.7  | 0.93x slower |
|  128 | 32768 |   512 | 40 |    6 | 1090.0    | 3.9    | 23.1  | 0.95x slower |
|  128 | 32768 |   512 | 40 |    8 | 1163.8    | 3.7    | 21.7  | 0.94x slower |
|  128 |  7168 | 16384 | 40 |    1 | 4414.5    | 6.8    | 27.5  | 0.97x slower |
|  128 |  7168 | 16384 | 40 |    6 | 4992.0    | 6.0    | 24.3  | 0.97x slower |
|  128 |  7168 | 16384 | 40 |    8 | 5087.7    | 5.9    | 23.9  | 0.97x slower |
|  128 |  4096 |  7168 | 40 |    1 | 1159.1    | 6.5    | 27.0  | 0.96x slower |
|  128 |  4096 |  7168 | 40 |    6 | 1375.5    | 5.5    | 22.8  | 0.95x slower |
|  128 |  4096 |  7168 | 40 |    8 | 1406.9    | 5.3    | 22.3  | 0.94x slower |
|  128 |  7168 |  2048 | 40 |    1 | 674.2     | 5.6    | 24.9  | 0.96x slower |
|  128 |  7168 |  2048 | 40 |    6 | 771.0     | 4.9    | 21.8  | 0.95x slower |
|  128 |  7168 |  2048 | 40 |    8 | 793.8     | 4.7    | 21.1  | 0.94x slower |
| 4096 | 24576 |  1536 | 40 |    1 | 2795.6    | 110.6  | 87.8  | 1.16x faster |
| 4096 | 24576 |  1536 | 40 |    6 | 13104.1   | 23.6   | 18.7  | 1.13x faster |
| 4096 | 24576 |  1536 | 40 |    8 | 17768.9   | 17.4   | 13.8  | 1.09x faster |
| 4096 | 32768 |   512 | 40 |    1 | 2201.8    | 62.4   | 130.5 | 1.13x faster |
| 4096 | 32768 |   512 | 40 |    6 | 10419.0   | 13.2   | 27.6  | 1.18x faster |
| 4096 | 32768 |   512 | 40 |    8 | 13843.7   | 9.9    | 20.8  | 1.16x faster |
| 4096 |  7168 | 16384 | 40 |    1 | 6469.8    | 148.7  | 37.6  | 1.24x faster |
| 4096 |  7168 | 16384 | 40 |    6 | 28696.3   | 33.5   | 8.5   | 1.14x faster |
| 4096 |  7168 | 16384 | 40 |    8 | 39607.4   | 24.3   | 6.1   | 1.07x faster |
| 4096 |  4096 |  7168 | 40 |    1 | 2063.4    | 116.6  | 44.7  | 1.10x faster |
| 4096 |  4096 |  7168 | 40 |    6 | 8889.7    | 27.1   | 10.4  | 1.03x faster |
| 4096 |  4096 |  7168 | 40 |    8 | 12188.7   | 19.7   | 7.6   | 0.98x slower |
| 4096 |  7168 |  2048 | 40 |    1 | 1189.2    | 101.1  | 68.8  | 1.06x faster |
| 4096 |  7168 |  2048 | 40 |    6 | 5581.2    | 21.5   | 14.7  | 0.98x slower |
| 4096 |  7168 |  2048 | 40 |    8 | 7335.9    | 16.4   | 11.1  | 0.96x slower |
+------+-------+-------+----+------+-----------+--------+-------+--------------+

===== AVERAGE PERFORMANCE =====
+----------------+------------+----------+---------------+
| Implementation | Avg TFLOPS | Avg GB/s | Avg Time (ms) |
+----------------+------------+----------+---------------+
| vLLM Triton    | 21.98      | 44.54    | 4.84          |
| vLLM CUTLASS   | 21.83      | 43.38    | 4.91          |
+----------------+------------+----------+---------------+

===== AVERAGE SPEEDUPS =====
+-----------------------------+--------------+
| Comparison                  | Speedup      |
+-----------------------------+--------------+
| vLLM CUTLASS vs vLLM Triton | 0.97x slower |
+-----------------------------+--------------+

Future PRs should involve improvements in speed for functions that preprocess data before running the MoE kernels.

Signed-off-by: ElizaWszola <ewszola@redhat.com>

…ate it yet Signed-off-by: ElizaWszola <ewszola@redhat.com>

Signed-off-by: ElizaWszola <ewszola@redhat.com>

github-actions · 2025-06-23T12:43:41Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @ElizaWszola, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances vLLM's quantization capabilities by integrating blocked FP8 CUTLASS MoE support for SM90 GPUs. This addition provides a more optimized path for running Mixture of Experts models with fine-grained FP8 quantization, improving efficiency and potentially performance on compatible hardware.

Highlights

New Feature: Blocked FP8 CUTLASS MoE: This PR introduces support for blocked FP8 quantization within the CUTLASS Mixture of Experts (MoE) implementation, specifically targeting SM90 (NVIDIA Hopper) architectures. This enables more efficient execution of MoE layers with block-wise FP8 weights and activations.
Core Kernel Implementation: A new CUDA kernel, grouped_mm_c3x_with_blockwise_scaling.cu, has been added. This kernel leverages CUTLASS 3.x to perform grouped matrix multiplications with block-wise FP8 scaling, which is crucial for the new quantization scheme.
API and Integration: New C++ functions (cutlass_moe_blockwise_mm, transpose_cutlass_moe_a_scales) and their Python bindings have been introduced to expose the blocked MoE functionality. The vllm FP8 quantization handler is updated to utilize this new CUTLASS implementation when supported by the hardware.
Comprehensive Testing: Extensive unit tests have been added to validate the new blocked FP8 MoE implementation, including tests for single grouped multiply operations and fused experts operations, ensuring correctness and performance.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR adds support for blocked FP8 CUTLASS MoE for SM90 architectures. The changes are comprehensive. My review focuses on correctness and maintainability. I've found a critical compilation issue due to a typo in a macro definition. Additionally, there are several areas with commented-out code, debug statements, and redundant checks that should be cleaned up to improve code quality before merging. The new tests are comprehensive and cover the new functionality well.

csrc/quantization/cutlass_w8a8/moe/get_group_starts.cuh

csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x_with_blockwise_scaling.cu

gemini-code-assist · 2025-06-23T12:46:47Z

csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x_with_blockwise_scaling.cu

+  TORCH_CHECK(a_tensors.dtype() == torch::kFloat8_e4m3fn);
+  TORCH_CHECK(b_tensors.dtype() == torch::kFloat8_e4m3fn);


These TORCH_CHECKs are redundant. The same checks are performed on lines 115-118 with more descriptive error messages. Remove these duplicate lines.

gemini-code-assist · 2025-06-23T12:46:47Z

csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x_with_blockwise_scaling.cu

+  //   if (n >= 8192) {
+  //     cutlass_blockwise_group_gemm_caller<Cutlass3xGemmN8192>(
+  //         out_tensors, a_tensors, b_tensors, a_scales, b_scales,
+  //         expert_offsets, problem_sizes, a_strides, b_strides, c_strides,
+  //         per_act_block);
+  //   } else if (k >= 8192) {
+  //     cutlass_blockwise_group_gemm_caller<Cutlass3xGemmK8192>(
+  //         out_tensors, a_tensors, b_tensors, a_scales, b_scales,
+  //         expert_offsets, problem_sizes, a_strides, b_strides, c_strides,
+  //         per_act_block);
+  //   } else if (m <= 16) {
+  //     cutlass_blockwise_group_gemm_caller<Cutlass3xGemmM16>(
+  //         out_tensors, a_tensors, b_tensors, a_scales, b_scales,
+  //         expert_offsets, problem_sizes, a_strides, b_strides, c_strides,
+  //         per_act_block,);
+  //   } else {
+  cutlass_blockwise_group_gemm_caller<Cutlass3xGemmDefault>(
+      out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets,
+      problem_sizes, a_strides, b_strides, c_strides, per_act_block);
+  //   }


This block of commented-out code appears to be an alternative dispatch logic based on problem dimensions (M, N, K). Since it's not currently used, it should be removed to keep the code clean.

csrc/quantization/cutlass_w8a8/moe/moe_data.cu

Signed-off-by: ElizaWszola <ewszola@redhat.com>

djmmoss · 2025-06-24T20:53:36Z

@ElizaWszola If you want to get your PR in first, I don't mind following up with the SM100 PR #19757 but modifying csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x_with_blockwise_scaling*, etc with the blackwell related changes. As far as I can tell this will also address #19757 (comment) and your comments correct?

tlrmchlsmth

Let's tune these kernels and see how they measure up to DeepGEMM!

tlrmchlsmth · 2025-06-24T21:51:09Z

csrc/ops.h

+    torch::Tensor const& b_scales, torch::Tensor const& expert_offsets,
+    torch::Tensor const& problem_sizes, torch::Tensor const& a_strides,
+    torch::Tensor const& b_strides, torch::Tensor const& c_strides,
+    bool per_act_block);


What does per_act_block mean? What does it mean if it's true and what does it mean if it's false? Please add some documentation

When it's true, this means [1x128]-block input scales. When it's false, we use per tensor scales - not sure if this is or will be needed, I can delete if it's too much extra code.

I think it's good to support both cases. But please add some comments documenting what the variable means.

I've added relevant comments in torch_bindings.cpp and cutlass_moe.py.

tlrmchlsmth · 2025-06-24T21:52:50Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

+    if per_act_block:
+        a1q_scale_t = torch.empty((a1q_scale.shape[0] * a1q_scale.shape[1]),
+                                  device=device,
+                                  dtype=a1q_scale.dtype)
+        ops.transpose_cutlass_moe_a_scales(a1q_scale_t, a1q_scale,
+                                           expert_offsets, problem_sizes1)


Have you run the pytorch profiler to see how much time this takes?

Would the code be cleaner if ops.transpose_cutlass_moe_a_scales returned a1q_scale_t rather than requiring the caller to allocate it with torch.empty?

I've profiled the function with a few different sizes and the both transpose function together take only slightly more time than the quantization of intermediate results (a2q, a2q_scale = ops.scaled_fp8_quant(...)). So for large enough inputs, this is an order of magnitude less than the kernel runtimes.

csrc/quantization/cutlass_w8a8/moe/moe_data.cu

csrc/torch_bindings.cpp

mergify · 2025-06-24T22:06:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ElizaWszola · 2025-06-26T14:52:02Z

as far as I can tell this will also address #19757 (comment) and your comments correct?

@djmmoss Yes, this should address them. This PR might land a bit late though - I still have to do a bit of benchmarking and possibly add a bunch of kernel configs for performance

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify · 2025-07-22T14:35:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: ElizaWszola <ewszola@redhat.com>

LucasWilkinson · 2025-07-28T15:09:36Z

CMakeLists.txt

@@ -637,13 +654,14 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
  endif()

  cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a" "${CUDA_ARCHS}")
-  if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
+  if(VLLM_COMPILE_FP8_BLOCKWISE_CUTLASS_MOE AND ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)


we want to guard against the existing sm100 kernels here too?

This is where I'm waiting for the input from the SM100 kernel's author. I e2e-benchmarked CUTLASS vs. Triton on a SM100 machine and CUTLASS was slower, but I would like them to confirm that CUTLASS is slower than Triton also with their setup.

mergify · 2025-07-29T17:41:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

yewentao256

Thanks for the work!

yewentao256 · 2025-07-31T16:56:26Z

vllm/model_executor/layers/quantization/utils/fp8_utils.py

@@ -598,6 +598,55 @@ def get_w8a8_block_fp8_configs(N: int, K: int, block_n: int,
    return None


+# Copied and adapted from
+# https://github.com/deepseek-ai/DeepGEMM/blob/78cacf70d41d15d688bd493ebc85845f7f2a3d5d/tests/test_core.py#L17
+def per_block_cast_to_fp8(


We can reuse other per_block_cast_to_fp8 , similar to #21787

yewentao256 · 2025-07-31T16:56:40Z

vllm/model_executor/layers/quantization/utils/fp8_utils.py

+    assert x.dim() == 2
+    m, n = x.shape
+
+    def ceil_div(x: int, y: int) -> int:


Use vllm utils cdiv please

yewentao256 · 2025-07-31T16:57:38Z

vllm/model_executor/layers/quantization/utils/fp8_utils.py

+def native_per_token_group_quant_fp8(x,
+                                     group_size,
+                                     eps=1e-10,
+                                     dtype=torch.float8_e4m3fn):
+    """Function to perform per-token-group quantization on an input tensor
+    `x` using native torch."""
+    assert x.shape[-1] % group_size == 0, ("the last dimension of `x` cannot "
+                                           "be divisible by `group_size`")
+    assert x.is_contiguous(), "`x` is not contiguous"
+
+    finfo = torch.finfo(dtype)
+    fp8_min = finfo.min
+    fp8_max = finfo.max
+
+    x_ = x.reshape(x.numel() // group_size, group_size)
+    amax = x_.abs().max(dim=-1,
+                        keepdim=True)[0].clamp(min=eps).to(torch.float32)
+    x_s = amax / fp8_max
+    x_q = (x_ / x_s).clamp(min=fp8_min, max=fp8_max).to(dtype)
+    x_q = x_q.reshape(x.shape)
+    x_s = x_s.reshape(x.shape[:-1] + (x.shape[-1] // group_size, ))
+
+    return x_q, x_s


Why do we need this instead of cuda kernel per token group quant?

yewentao256 · 2025-07-31T16:58:59Z

csrc/quantization/cutlass_w8a8/moe/moe_data.cu

+                                   const float* a_scales,
+                                   const int32_t* expert_offsets,
+                                   const int32_t* problem_sizes,
+                                   int64_t k_scaled) {
+  int64_t expert_idx = blockIdx.x;
+  int64_t start_k_scaled = threadIdx.x;
+  int64_t step_k_scaled = blockDim.x;
+  int64_t expert_offset = expert_offsets[expert_idx];
+  int64_t num_tokens = problem_sizes[expert_idx * 3];
+  int64_t expert_offset_scaled = expert_offset * k_scaled;
+
+  for (int64_t t = 0; t < num_tokens; ++t) {
+    for (int64_t k = start_k_scaled; k < k_scaled; k += step_k_scaled) {
+      a_scales_t[expert_offset_scaled + k * num_tokens + t] =
+          a_scales[expert_offset_scaled + t * k_scaled + k];
+    }
+  }
+}


Pure for scalar would be slow.
Some thoughts that we could optimize this:

vectorization using vectorize_with_alignment

shared memory tile

yewentao256 · 2025-07-31T17:01:42Z

csrc/quantization/cutlass_w8a8/moe/moe_data.cu


-  // Swap-AB should be disabled for FP4 path
-  bool may_swap_ab = (!blockscale_offsets.has_value()) &&
-                     (topk_ids.numel() <= SWAP_AB_THRESHOLD);
-
-  if (may_swap_ab) {
+  bool swap_ab = !force_no_swap && topk_ids.numel() <= SWAP_AB_THRESHOLD;
+  if (swap_ab) {


If we are not using blockscale_offsets.has_value(), we should make sure that nvfp4 path will pass in the force_no_swap

varun-sundar-rabindranath · 2025-08-10T18:21:12Z

csrc/quantization/cutlass_w8a8/moe/blockwise_scaled_group_mm_sm90.cuh

+  using ElementScale = typename Gemm::ElementScale;
+  using ScaleConfig = typename Gemm::ScaleConfig;
+  using LayoutSFA = typename Gemm::LayoutSFA;
+  using LayoutSFB = typename Gemm::LayoutSFB;


Maybe setting the correct LayoutSFA could avoid the transpose of a_scales in cutlass_moe.py ?

Many thanks! Will look into this tomorrow.

varun-sundar-rabindranath · 2025-08-14T12:34:59Z

tests/kernels/moe/test_cutlass_moe.py

+        # Get the right scale for tests.
+        if per_act_block:
+            a_q, a_scale = per_token_group_quant_fp8(moe_tensors_fp16.a,
+                                                     block_size[1])


transpose here ?

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Signed-off-by: elvircrn <elvircrn@gmail.com>

elvircrn · 2025-08-21T16:17:19Z

These are the latest benchmark numbers of H100:

vLLM Triton Implementation:
+------+-------+-------+----+------+-----------+--------+-------+
| m    | n     | k     | e  | topk | Time (μs) | TFLOPS | GB/s  |
+------+-------+-------+----+------+-----------+--------+-------+
|   64 | 24576 |  1536 |  8 |    1 | 407.3     | 11.9   | 100.6 |
|   64 | 24576 |  1536 |  8 |    6 | 471.4     | 10.2   | 87.0  |
|   64 | 24576 |  1536 |  8 |    8 | 494.7     | 9.8    | 82.9  |
|   64 | 32768 |   512 |  8 |    1 | 302.7     | 7.1    | 69.4  |
|   64 | 32768 |   512 |  8 |    6 | 363.1     | 5.9    | 57.8  |
|   64 | 32768 |   512 |  8 |    8 | 382.8     | 5.6    | 54.9  |
|   64 |  7168 | 16384 |  8 |    1 | 995.6     | 15.1   | 119.9 |
|   64 |  7168 | 16384 |  8 |    6 | 1043.1    | 14.4   | 114.5 |
|   64 |  7168 | 16384 |  8 |    8 | 1070.1    | 14.0   | 111.6 |
|   64 |  4096 |  7168 |  8 |    1 | 282.6     | 13.3   | 107.4 |
|   64 |  4096 |  7168 |  8 |    6 | 299.8     | 12.5   | 101.2 |
|   64 |  4096 |  7168 |  8 |    8 | 308.5     | 12.2   | 98.4  |
|   64 |  7168 |  2048 |  8 |    1 | 262.3     | 7.2    | 60.0  |
|   64 |  7168 |  2048 |  8 |    6 | 248.1     | 7.6    | 63.4  |
|   64 |  7168 |  2048 |  8 |    8 | 236.4     | 7.9    | 66.5  |
|  128 | 24576 |  1536 |  8 |    1 | 417.3     | 23.2   | 106.0 |
|  128 | 24576 |  1536 |  8 |    6 | 621.1     | 15.6   | 71.2  |
|  128 | 24576 |  1536 |  8 |    8 | 660.1     | 14.6   | 67.0  |
|  128 | 32768 |   512 |  8 |    1 | 313.6     | 13.7   | 80.5  |
|  128 | 32768 |   512 |  8 |    6 | 494.6     | 8.7    | 51.0  |
|  128 | 32768 |   512 |  8 |    8 | 533.1     | 8.1    | 47.3  |
|  128 |  7168 | 16384 |  8 |    1 | 1011.2    | 29.7   | 120.0 |
|  128 |  7168 | 16384 |  8 |    6 | 1428.4    | 21.0   | 85.0  |
|  128 |  7168 | 16384 |  8 |    8 | 1487.3    | 20.2   | 81.6  |
|  128 |  4096 |  7168 |  8 |    1 | 289.9     | 25.9   | 108.1 |
|  128 |  4096 |  7168 |  8 |    6 | 410.5     | 18.3   | 76.3  |
|  128 |  4096 |  7168 |  8 |    8 | 420.8     | 17.9   | 74.4  |
|  128 |  7168 |  2048 |  8 |    1 | 258.6     | 14.5   | 64.9  |
|  128 |  7168 |  2048 |  8 |    6 | 245.1     | 15.3   | 68.5  |
|  128 |  7168 |  2048 |  8 |    8 | 253.2     | 14.8   | 66.3  |
| 4096 | 24576 |  1536 |  8 |    1 | 2359.2    | 131.1  | 104.0 |
| 4096 | 24576 |  1536 |  8 |    6 | 12536.7   | 24.7   | 19.6  |
| 4096 | 24576 |  1536 |  8 |    8 | 16447.7   | 18.8   | 14.9  |
| 4096 | 32768 |   512 |  8 |    1 | 1773.8    | 77.5   | 162.0 |
| 4096 | 32768 |   512 |  8 |    6 | 9062.2    | 15.2   | 31.7  |
| 4096 | 32768 |   512 |  8 |    8 | 11928.6   | 11.5   | 24.1  |
| 4096 |  7168 | 16384 |  8 |    1 | 5737.8    | 167.7  | 42.4  |
| 4096 |  7168 | 16384 |  8 |    6 | 29182.1   | 33.0   | 8.3   |
| 4096 |  7168 | 16384 |  8 |    8 | 37809.0   | 25.4   | 6.4   |
| 4096 |  4096 |  7168 |  8 |    1 | 1593.2    | 151.0  | 57.9  |
| 4096 |  4096 |  7168 |  8 |    6 | 7795.3    | 30.9   | 11.8  |
| 4096 |  4096 |  7168 |  8 |    8 | 10116.5   | 23.8   | 9.1   |
| 4096 |  7168 |  2048 |  8 |    1 | 899.7     | 133.7  | 90.9  |
| 4096 |  7168 |  2048 |  8 |    6 | 4598.5    | 26.2   | 17.8  |
| 4096 |  7168 |  2048 |  8 |    8 | 6032.2    | 19.9   | 13.6  |
|   64 | 24576 |  1536 | 40 |    1 | 1332.0    | 3.6    | 30.8  |
|   64 | 24576 |  1536 | 40 |    6 | 1645.2    | 2.9    | 24.9  |
|   64 | 24576 |  1536 | 40 |    8 | 1667.9    | 2.9    | 24.6  |
|   64 | 32768 |   512 | 40 |    1 | 610.8     | 3.5    | 34.4  |
|   64 | 32768 |   512 | 40 |    6 | 862.5     | 2.5    | 24.4  |
|   64 | 32768 |   512 | 40 |    8 | 880.3     | 2.4    | 23.9  |
|   64 |  7168 | 16384 | 40 |    1 | 3804.6    | 4.0    | 31.4  |
|   64 |  7168 | 16384 | 40 |    6 | 4680.1    | 3.2    | 25.5  |
|   64 |  7168 | 16384 | 40 |    8 | 4710.0    | 3.2    | 25.4  |
|   64 |  4096 |  7168 | 40 |    1 | 946.9     | 4.0    | 32.0  |
|   64 |  4096 |  7168 | 40 |    6 | 1227.4    | 3.1    | 24.7  |
|   64 |  4096 |  7168 | 40 |    8 | 1237.7    | 3.0    | 24.5  |
|   64 |  7168 |  2048 | 40 |    1 | 512.7     | 3.7    | 30.7  |
|   64 |  7168 |  2048 | 40 |    6 | 663.2     | 2.8    | 23.7  |
|   64 |  7168 |  2048 | 40 |    8 | 673.6     | 2.8    | 23.3  |
|  128 | 24576 |  1536 | 40 |    1 | 1511.3    | 6.4    | 29.3  |
|  128 | 24576 |  1536 | 40 |    6 | 1717.5    | 5.6    | 25.8  |
|  128 | 24576 |  1536 | 40 |    8 | 1752.3    | 5.5    | 25.2  |
|  128 | 32768 |   512 | 40 |    1 | 790.7     | 5.4    | 31.9  |
|  128 | 32768 |   512 | 40 |    6 | 937.4     | 4.6    | 26.9  |
|  128 | 32768 |   512 | 40 |    8 | 975.3     | 4.4    | 25.9  |
|  128 |  7168 | 16384 | 40 |    1 | 4284.4    | 7.0    | 28.3  |
|  128 |  7168 | 16384 | 40 |    6 | 4779.1    | 6.3    | 25.4  |
|  128 |  7168 | 16384 | 40 |    8 | 4820.6    | 6.2    | 25.2  |
|  128 |  4096 |  7168 | 40 |    1 | 1093.6    | 6.9    | 28.6  |
|  128 |  4096 |  7168 | 40 |    6 | 1259.0    | 6.0    | 24.9  |
|  128 |  4096 |  7168 | 40 |    8 | 1268.2    | 5.9    | 24.7  |
|  128 |  7168 |  2048 | 40 |    1 | 631.4     | 6.0    | 26.6  |
|  128 |  7168 |  2048 | 40 |    6 | 692.6     | 5.4    | 24.2  |
|  128 |  7168 |  2048 | 40 |    8 | 699.0     | 5.4    | 24.0  |
| 4096 | 24576 |  1536 | 40 |    1 | 2994.1    | 103.3  | 81.9  |
| 4096 | 24576 |  1536 | 40 |    6 | 13316.6   | 23.2   | 18.4  |
| 4096 | 24576 |  1536 | 40 |    8 | 17419.5   | 17.8   | 14.1  |
| 4096 | 32768 |   512 | 40 |    1 | 2060.1    | 66.7   | 139.5 |
| 4096 | 32768 |   512 | 40 |    6 | 9657.0    | 14.2   | 29.8  |
| 4096 | 32768 |   512 | 40 |    8 | 12646.8   | 10.9   | 22.7  |
| 4096 |  7168 | 16384 | 40 |    1 | 7596.6    | 126.6  | 32.0  |
| 4096 |  7168 | 16384 | 40 |    6 | 31141.1   | 30.9   | 7.8   |
| 4096 |  7168 | 16384 | 40 |    8 | 40627.5   | 23.7   | 6.0   |
| 4096 |  4096 |  7168 | 40 |    1 | 2087.6    | 115.2  | 44.2  |
| 4096 |  4096 |  7168 | 40 |    6 | 8375.8    | 28.7   | 11.0  |
| 4096 |  4096 |  7168 | 40 |    8 | 10947.7   | 22.0   | 8.4   |
| 4096 |  7168 |  2048 | 40 |    1 | 1142.4    | 105.3  | 71.6  |
| 4096 |  7168 |  2048 | 40 |    6 | 4892.9    | 24.6   | 16.7  |
| 4096 |  7168 |  2048 | 40 |    8 | 6343.5    | 19.0   | 12.9  |
+------+-------+-------+----+------+-----------+--------+-------+

vLLM CUTLASS Implementation:
+------+-------+-------+----+------+-----------+--------+-------+--------------+
| m    | n     | k     | e  | topk | Time (μs) | TFLOPS | GB/s  | vs Triton    |
+------+-------+-------+----+------+-----------+--------+-------+--------------+
|   64 | 24576 |  1536 |  8 |    1 | 409.8     | 11.8   | 100.0 | 0.99x slower |
|   64 | 24576 |  1536 |  8 |    6 | 458.3     | 10.5   | 89.4  | 1.03x faster |
|   64 | 24576 |  1536 |  8 |    8 | 478.5     | 10.1   | 85.7  | 1.03x faster |
|   64 | 32768 |   512 |  8 |    1 | 288.0     | 7.5    | 72.9  | 1.05x faster |
|   64 | 32768 |   512 |  8 |    6 | 336.4     | 6.4    | 62.4  | 1.08x faster |
|   64 | 32768 |   512 |  8 |    8 | 355.3     | 6.0    | 59.1  | 1.08x faster |
|   64 |  7168 | 16384 |  8 |    1 | 1041.4    | 14.4   | 114.7 | 0.96x slower |
|   64 |  7168 | 16384 |  8 |    6 | 1119.8    | 13.4   | 106.6 | 0.93x slower |
|   64 |  7168 | 16384 |  8 |    8 | 1150.9    | 13.1   | 103.7 | 0.93x slower |
|   64 |  4096 |  7168 |  8 |    1 | 323.0     | 11.6   | 93.9  | 0.87x slower |
|   64 |  4096 |  7168 |  8 |    6 | 361.5     | 10.4   | 83.9  | 0.83x slower |
|   64 |  4096 |  7168 |  8 |    8 | 378.1     | 9.9    | 80.3  | 0.82x slower |
|   64 |  7168 |  2048 |  8 |    1 | 226.0     | 8.3    | 69.6  | 1.16x faster |
|   64 |  7168 |  2048 |  8 |    6 | 221.4     | 8.5    | 71.1  | 1.12x faster |
|   64 |  7168 |  2048 |  8 |    8 | 231.1     | 8.1    | 68.1  | 1.02x faster |
|  128 | 24576 |  1536 |  8 |    1 | 417.4     | 23.1   | 106.0 | 1.00x slower |
|  128 | 24576 |  1536 |  8 |    6 | 545.7     | 17.7   | 81.1  | 1.14x faster |
|  128 | 24576 |  1536 |  8 |    8 | 591.1     | 16.3   | 74.8  | 1.12x faster |
|  128 | 32768 |   512 |  8 |    1 | 292.1     | 14.7   | 86.4  | 1.07x faster |
|  128 | 32768 |   512 |  8 |    6 | 397.7     | 10.8   | 63.4  | 1.24x faster |
|  128 | 32768 |   512 |  8 |    8 | 455.6     | 9.4    | 55.4  | 1.17x faster |
|  128 |  7168 | 16384 |  8 |    1 | 1064.5    | 28.2   | 114.0 | 0.95x slower |
|  128 |  7168 | 16384 |  8 |    6 | 1281.2    | 23.5   | 94.7  | 1.11x faster |
|  128 |  7168 | 16384 |  8 |    8 | 1366.8    | 22.0   | 88.8  | 1.09x faster |
|  128 |  4096 |  7168 |  8 |    1 | 331.2     | 22.7   | 94.6  | 0.88x slower |
|  128 |  4096 |  7168 |  8 |    6 | 419.1     | 17.9   | 74.8  | 0.98x slower |
|  128 |  4096 |  7168 |  8 |    8 | 448.9     | 16.7   | 69.8  | 0.94x slower |
|  128 |  7168 |  2048 |  8 |    1 | 221.6     | 17.0   | 75.7  | 1.17x faster |
|  128 |  7168 |  2048 |  8 |    6 | 256.4     | 14.7   | 65.4  | 0.96x slower |
|  128 |  7168 |  2048 |  8 |    8 | 276.4     | 13.6   | 60.7  | 0.92x slower |
| 4096 | 24576 |  1536 |  8 |    1 | 1942.6    | 159.2  | 126.3 | 1.21x faster |
| 4096 | 24576 |  1536 |  8 |    6 | 11088.5   | 27.9   | 22.1  | 1.13x faster |
| 4096 | 24576 |  1536 |  8 |    8 | 14689.3   | 21.1   | 16.7  | 1.12x faster |
| 4096 | 32768 |   512 |  8 |    1 | 1407.2    | 97.7   | 204.2 | 1.26x faster |
| 4096 | 32768 |   512 |  8 |    6 | 7827.6    | 17.6   | 36.7  | 1.16x faster |
| 4096 | 32768 |   512 |  8 |    8 | 10348.8   | 13.3   | 27.8  | 1.15x faster |
| 4096 |  7168 | 16384 |  8 |    1 | 5325.6    | 180.6  | 45.7  | 1.08x faster |
| 4096 |  7168 | 16384 |  8 |    6 | 30734.1   | 31.3   | 7.9   | 0.95x slower |
| 4096 |  7168 | 16384 |  8 |    8 | 41547.3   | 23.2   | 5.9   | 0.91x slower |
| 4096 |  4096 |  7168 |  8 |    1 | 1714.9    | 140.2  | 53.8  | 0.93x slower |
| 4096 |  4096 |  7168 |  8 |    6 | 8447.6    | 28.5   | 10.9  | 0.92x slower |
| 4096 |  4096 |  7168 |  8 |    8 | 11113.3   | 21.6   | 8.3   | 0.91x slower |
| 4096 |  7168 |  2048 |  8 |    1 | 881.9     | 136.4  | 92.7  | 1.02x faster |
| 4096 |  7168 |  2048 |  8 |    6 | 4457.6    | 27.0   | 18.3  | 1.03x faster |
| 4096 |  7168 |  2048 |  8 |    8 | 5798.6    | 20.7   | 14.1  | 1.04x faster |
|   64 | 24576 |  1536 | 40 |    1 | 1339.7    | 3.6    | 30.6  | 0.99x slower |
|   64 | 24576 |  1536 | 40 |    6 | 1687.4    | 2.9    | 24.3  | 0.97x slower |
|   64 | 24576 |  1536 | 40 |    8 | 1718.6    | 2.8    | 23.9  | 0.97x slower |
|   64 | 32768 |   512 | 40 |    1 | 612.2     | 3.5    | 34.3  | 1.00x slower |
|   64 | 32768 |   512 | 40 |    6 | 922.3     | 2.3    | 22.8  | 0.94x slower |
|   64 | 32768 |   512 | 40 |    8 | 940.0     | 2.3    | 22.3  | 0.94x slower |
|   64 |  7168 | 16384 | 40 |    1 | 3890.4    | 3.9    | 30.7  | 0.98x slower |
|   64 |  7168 | 16384 | 40 |    6 | 4829.6    | 3.1    | 24.7  | 0.97x slower |
|   64 |  7168 | 16384 | 40 |    8 | 4896.4    | 3.1    | 24.4  | 0.96x slower |
|   64 |  4096 |  7168 | 40 |    1 | 981.4     | 3.8    | 30.9  | 0.96x slower |
|   64 |  4096 |  7168 | 40 |    6 | 1292.4    | 2.9    | 23.5  | 0.95x slower |
|   64 |  4096 |  7168 | 40 |    8 | 1317.8    | 2.9    | 23.0  | 0.94x slower |
|   64 |  7168 |  2048 | 40 |    1 | 557.3     | 3.4    | 28.2  | 0.92x slower |
|   64 |  7168 |  2048 | 40 |    6 | 699.1     | 2.7    | 22.5  | 0.95x slower |
|   64 |  7168 |  2048 | 40 |    8 | 714.1     | 2.6    | 22.0  | 0.94x slower |
|  128 | 24576 |  1536 | 40 |    1 | 1559.5    | 6.2    | 28.4  | 0.97x slower |
|  128 | 24576 |  1536 | 40 |    6 | 1769.9    | 5.5    | 25.0  | 0.97x slower |
|  128 | 24576 |  1536 | 40 |    8 | 1819.7    | 5.3    | 24.3  | 0.96x slower |
|  128 | 32768 |   512 | 40 |    1 | 854.1     | 5.0    | 29.5  | 0.93x slower |
|  128 | 32768 |   512 | 40 |    6 | 984.4     | 4.4    | 25.6  | 0.95x slower |
|  128 | 32768 |   512 | 40 |    8 | 1037.2    | 4.1    | 24.3  | 0.94x slower |
|  128 |  7168 | 16384 | 40 |    1 | 4405.5    | 6.8    | 27.6  | 0.97x slower |
|  128 |  7168 | 16384 | 40 |    6 | 5001.8    | 6.0    | 24.3  | 0.96x slower |
|  128 |  7168 | 16384 | 40 |    8 | 5106.4    | 5.9    | 23.8  | 0.94x slower |
|  128 |  4096 |  7168 | 40 |    1 | 1144.6    | 6.6    | 27.4  | 0.96x slower |
|  128 |  4096 |  7168 | 40 |    6 | 1354.9    | 5.5    | 23.1  | 0.93x slower |
|  128 |  4096 |  7168 | 40 |    8 | 1396.2    | 5.4    | 22.4  | 0.91x slower |
|  128 |  7168 |  2048 | 40 |    1 | 661.4     | 5.7    | 25.4  | 0.95x slower |
|  128 |  7168 |  2048 | 40 |    6 | 740.2     | 5.1    | 22.7  | 0.94x slower |
|  128 |  7168 |  2048 | 40 |    8 | 758.5     | 5.0    | 22.1  | 0.92x slower |
| 4096 | 24576 |  1536 | 40 |    1 | 2485.9    | 124.4  | 98.7  | 1.20x faster |
| 4096 | 24576 |  1536 | 40 |    6 | 11420.4   | 27.1   | 21.5  | 1.17x faster |
| 4096 | 24576 |  1536 | 40 |    8 | 15214.7   | 20.3   | 16.1  | 1.14x faster |
| 4096 | 32768 |   512 | 40 |    1 | 1739.3    | 79.0   | 165.2 | 1.18x faster |
| 4096 | 32768 |   512 | 40 |    6 | 8030.8    | 17.1   | 35.8  | 1.20x faster |
| 4096 | 32768 |   512 | 40 |    8 | 10683.7   | 12.9   | 26.9  | 1.18x faster |
| 4096 |  7168 | 16384 | 40 |    1 | 6444.2    | 149.3  | 37.7  | 1.18x faster |
| 4096 |  7168 | 16384 | 40 |    6 | 29620.2   | 32.5   | 8.2   | 1.05x faster |
| 4096 |  7168 | 16384 | 40 |    8 | 40327.5   | 23.9   | 6.0   | 1.01x faster |
| 4096 |  4096 |  7168 | 40 |    1 | 1987.1    | 121.0  | 46.4  | 1.05x faster |
| 4096 |  4096 |  7168 | 40 |    6 | 8926.0    | 26.9   | 10.3  | 0.94x slower |
| 4096 |  4096 |  7168 | 40 |    8 | 12302.4   | 19.6   | 7.5   | 0.89x slower |
| 4096 |  7168 |  2048 | 40 |    1 | 1049.9    | 114.5  | 77.9  | 1.09x faster |
| 4096 |  7168 |  2048 | 40 |    6 | 4952.8    | 24.3   | 16.5  | 0.99x slower |
| 4096 |  7168 |  2048 | 40 |    8 | 6488.8    | 18.5   | 12.6  | 0.98x slower |
+------+-------+-------+----+------+-----------+--------+-------+--------------+

===== AVERAGE PERFORMANCE =====
+----------------+------------+----------+---------------+
| Implementation | Avg TFLOPS | Avg GB/s | Avg Time (ms) |
+----------------+------------+----------+---------------+
| vLLM Triton    | 24.15      | 48.85    | 4.42          |
| vLLM CUTLASS   | 25.60      | 50.11    | 4.33          |
+----------------+------------+----------+---------------+

===== AVERAGE SPEEDUPS =====
+-----------------------------+--------------+
| Comparison                  | Speedup      |
+-----------------------------+--------------+
| vLLM CUTLASS vs vLLM Triton | 1.01x faster |
+-----------------------------+--------------+

===== ACCURACY COMPARISON =====
+----------------+-----------------------+
| Implementation | Avg Diff vs Reference |
+----------------+-----------------------+
| vLLM Triton    | 0.000294              |
| vLLM CUTLASS   | 0.000295              |
+----------------+-----------------------+

mergify · 2025-08-23T03:01:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Elvir Crnčević <elvircrn@gmail.com>

mergify · 2025-08-30T04:46:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ElizaWszola added 7 commits June 12, 2025 04:29

Blocked fp8 CUTLASS MoE

01ed30f

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Working for per_act_token=False

65382c8

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Input weight transpose: brings the max diff down, but does not elimin…

0a8a2e0

…ate it yet Signed-off-by: ElizaWszola <ewszola@redhat.com>

per_act_token=True works, started working on pplx

3e813f2

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Small cleanup, enable more test cases

42d33a2

Signed-off-by: ElizaWszola <ewszola@redhat.com>

factor-out pplx for another branch

6eed249

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Merge branch 'main' into blocked-fp8-cutlass-moe

ad47155

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify bot added the ci/build label Jun 23, 2025

gemini-code-assist bot reviewed Jun 23, 2025

View reviewed changes

Cleanup

579f594

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify bot added the performance Performance-related issues label Jun 23, 2025

ElizaWszola added 3 commits June 23, 2025 14:14

Bring back deep gemm function comment, more minor cleanups

df2d226

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Remove redundant checks

3f664e9

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Lint

407a414

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ElizaWszola marked this pull request as ready for review June 24, 2025 06:58

ElizaWszola requested review from tlrmchlsmth, WoosukKwon, mgoin and robertgshaw2-redhat as code owners June 24, 2025 06:58

mgoin changed the title ~~Blocked fp8 CUTLASS MoE~~ Blocked FP8 CUTLASS MoE for Hopper Jun 24, 2025

tlrmchlsmth reviewed Jun 24, 2025

View reviewed changes

mergify bot added the needs-rebase label Jun 24, 2025

tlrmchlsmth mentioned this pull request Jun 24, 2025

[feat]: CUTLASS block scaled group gemm for SM100 #19757

Merged

4 tasks

Merge branch 'main' into blocked-fp8-cutlass-moe

5aecfc9

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify bot removed the needs-rebase label Jun 25, 2025

Merge branch 'main' into blocked-fp8-cutlass-moe

9999e60

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify bot removed the needs-rebase label Jul 22, 2025

mergify bot added the needs-rebase label Jul 22, 2025

ElizaWszola added 5 commits July 24, 2025 05:52

Better fix to get_cutlass_moe_mm_data

00942ea

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Merge branch 'main' into blocked-fp8-cutlass-moe

66d14cc

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Disable blocked fp8 cutlass moe compilation by default

0e2518a

Signed-off-by: ElizaWszola <ewszola@redhat.com>

SM100 build fixes

5dfa4e5

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Merge branch 'main' into blocked-fp8-cutlass-moe

b07eddf

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify bot removed the needs-rebase label Jul 28, 2025

LucasWilkinson reviewed Jul 28, 2025

View reviewed changes

mergify bot added the needs-rebase label Jul 29, 2025

yewentao256 reviewed Jul 31, 2025

View reviewed changes

varun-sundar-rabindranath reviewed Aug 10, 2025

View reviewed changes

varun-sundar-rabindranath reviewed Aug 14, 2025

View reviewed changes

elvircrn mentioned this pull request Aug 19, 2025

Make sure that vectorize_with_alignment produced vectorized global loads #23182

Merged

Merge branch 'main' into blocked-fp8-cutlass-moe

9f0e046

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify bot removed the needs-rebase label Aug 21, 2025

ElizaWszola and others added 4 commits August 21, 2025 12:22

Post-merge fix

0ea71d2

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Blocked cutlass perf improvements

11259ca

Signed-off-by: elvircrn <elvircrn@gmail.com>

Bring back skip_quant

e20f22b

Signed-off-by: elvircrn <elvircrn@gmail.com>

Moe align sum tuning

f9253c3

Signed-off-by: elvircrn <elvircrn@gmail.com>

mergify bot added the needs-rebase label Aug 23, 2025

Merge branch 'main' into blocked-fp8-cutlass-moe

db8977c

Signed-off-by: Elvir Crnčević <elvircrn@gmail.com>

mergify bot removed the needs-rebase label Aug 26, 2025

mergify bot added the needs-rebase label Aug 30, 2025

		TORCH_CHECK(a_tensors.dtype() == torch::kFloat8_e4m3fn);
		TORCH_CHECK(b_tensors.dtype() == torch::kFloat8_e4m3fn);

Uh oh!

[Feature][Kernel] Blocked FP8 CUTLASS MoE for Hopper #19983

Are you sure you want to change the base?

[Feature][Kernel] Blocked FP8 CUTLASS MoE for Hopper #19983

Conversation

ElizaWszola commented Jun 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing:

Performance:

Uh oh!

github-actions bot commented Jun 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

djmmoss commented Jun 24, 2025

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ElizaWszola Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jun 24, 2025

Uh oh!

ElizaWszola commented Jun 26, 2025

Uh oh!

mergify bot commented Jul 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jul 29, 2025

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ElizaWszola commented Jun 23, 2025 •

edited by github-actions bot

Loading

ElizaWszola Jun 30, 2025 •

edited

Loading

elvircrn commented Aug 21, 2025 •

edited

Loading