Minor speed up block_quant_dequant #6814

fzyzcjy · 2025-06-02T13:20:04Z

Motivation

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Hello @fzyzcjy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! gemini-code-assist here, providing a summary of this pull request to help everyone get up to speed quickly.

This PR aims to significantly speed up the block_quant_dequant function within the FP8 quantization utilities. The primary motivation, based on the title, is performance optimization for this specific operation. The changes involve replacing a manual, loop-based implementation of block dequantization with a more efficient tensor-based approach leveraging the einops library.

Highlights

Performance Optimization: The core change is a rewrite of the block_quant_dequant function to replace nested Python loops and manual tiling with a single, potentially faster, tensor operation using einops.repeat.
Code Simplification: The implementation of block_quant_dequant is simplified by removing the explicit loop structure and tile handling, resulting in more concise code.
New Dependency: The einops library is added as a dependency to facilitate the tensor repetition needed for the new dequantization logic.
Input Validation: Assertions were added to block_quant_dequant to ensure that the input tensor dimensions are perfectly divisible by the specified block size, which is required for the new einops-based approach.

Changelog

python/sglang/srt/layers/quantization/fp8_utils.py
- Removed unused imports os and curses.flash.
- Added import for the einops library.
- Rewrote the block_quant_dequant function to use einops.repeat and tensor multiplication for dequantization, replacing the previous loop-based implementation.
- Added assertions to block_quant_dequant to check for divisibility of input dimensions by block size.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Loops run slow and long,
Tensor math makes code strong,
Speed up, dequantize!

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to speed up the block_quant_dequant function by replacing a loop-based implementation with a vectorized approach using einops. This is a good optimization strategy and should lead to performance improvements.

The main point to clarify is the change in input validation: the new implementation asserts that input tensor dimensions must be perfectly divisible by the block sizes, whereas the previous implementation appeared to handle non-perfectly divisible inputs. Please confirm if this change in behavior is intentional. If so, the function's docstring should be updated to reflect this new precondition.

Summary of Findings

Change in Input Validation for block_quant_dequant: The function block_quant_dequant now explicitly asserts that the input tensor dimensions n and k must be perfectly divisible by block_n and block_k respectively. The previous implementation appeared to handle non-perfectly divisible inputs. This change needs clarification to ensure it's intentional and to update documentation accordingly if it is.
Unused Imports Removed: The imports os and curses.flash were removed. This is a good cleanup as they were unused in the context of the changed function. (Severity: low, not commented due to review settings)

Merge Readiness

This pull request introduces a good performance optimization by vectorizing the block_quant_dequant function. However, there's a significant change in input validation that needs clarification. If the new assertions for input divisibility are intended and all callers adhere to this, the change is acceptable, but the docstring should be updated. If the previous behavior of handling non-divisible inputs needs to be maintained, the current changes would be breaking.

I recommend addressing the point about input validation before merging. I am unable to approve pull requests, so please ensure other reviewers approve this change after the clarification.

gemini-code-assist · 2025-06-02T13:21:28Z

python/sglang/srt/layers/quantization/fp8_utils.py

+    assert n % block_n == 0
+    assert k % block_k == 0


The new implementation introduces assertions n % block_n == 0 and k % block_k == 0. This is a stricter requirement compared to the previous version, which used min() in slicing (e.g., min((j + 1) * block_n, n)) and could handle cases where x_q_block dimensions were not perfectly divisible by block_size (as long as x_s had the correct number of scales for the tiles).

Could you clarify if this change in behavior is intentional?

If x_q_block is always guaranteed to have dimensions divisible by block_size in all call sites, then these assertions make an implicit precondition explicit, which is good. In this case, it would be beneficial to update the function's docstring (lines 374-378) to mention these new preconditions for x_q_block and x_s shapes.

If the function was previously expected to handle non-divisible shapes, this change might break existing callers or alter the function's contract. If the old behavior needs to be preserved, the einops approach might require padding or a different vectorization strategy.

This change is important as it affects the function's input contract.

Alcanderian · 2025-06-02T14:34:37Z

NO einops for multi-platform compatibility XD

fzyzcjy · 2025-06-02T14:38:10Z

@Alcanderian Ah, that one does not support AMD? I thought it is a wrapper around pytorch...

fzyzcjy · 2025-06-13T12:31:21Z

einops removed

cherry pick

bfd0cf6

fzyzcjy requested review from merrymercy, Ying1123, zhyncs, ispobock, HaiShaw, ch-wan and BBuf as code owners June 2, 2025 13:20

fzyzcjy changed the title ~~Speed up block_quant_dequant~~ Minor speed up block_quant_dequant Jun 2, 2025

gemini-code-assist bot reviewed Jun 2, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 2, 2025

View reviewed changes

zhyncs assigned Alcanderian Jun 2, 2025

zhyncs added the high priority label Jun 2, 2025

zhyncs and others added 6 commits June 2, 2025 11:14

Merge branch 'main' into feat/fast_dequant

945f5e9

Merge branch 'main' into feat/fast_dequant

89a7d52

Merge branch 'main' into feat/fast_dequant

f3f3542

cherry pick

92fcf44

more

3ec01fb

fmt

6b2521b

fzyzcjy and others added 4 commits June 13, 2025 20:31

merge

166f50d

more

3cc1bc5

more

15f9a21

Merge branch 'main' into feat/fast_dequant

bcfc162

zhyncs approved these changes Jun 13, 2025

View reviewed changes

zhyncs merged commit e3ec6bf into sgl-project:main Jun 13, 2025
3 of 48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Minor speed up block_quant_dequant #6814

Minor speed up block_quant_dequant #6814

Uh oh!

fzyzcjy commented Jun 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 2, 2025

Uh oh!

Alcanderian commented Jun 2, 2025

Uh oh!

fzyzcjy commented Jun 2, 2025

Uh oh!

fzyzcjy commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

Minor speed up block_quant_dequant #6814

Minor speed up block_quant_dequant #6814

Uh oh!

Conversation

fzyzcjy commented Jun 2, 2025

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

Alcanderian commented Jun 2, 2025

Uh oh!

fzyzcjy commented Jun 2, 2025

Uh oh!

fzyzcjy commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!