Skip to content

Conversation

fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Jun 2, 2025

Motivation

Modifications

Checklist

@fzyzcjy fzyzcjy changed the title Speed up block_quant_dequant Minor speed up block_quant_dequant Jun 2, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! gemini-code-assist here, providing a summary of this pull request to help everyone get up to speed quickly.

This PR aims to significantly speed up the block_quant_dequant function within the FP8 quantization utilities. The primary motivation, based on the title, is performance optimization for this specific operation. The changes involve replacing a manual, loop-based implementation of block dequantization with a more efficient tensor-based approach leveraging the einops library.

Highlights

  • Performance Optimization: The core change is a rewrite of the block_quant_dequant function to replace nested Python loops and manual tiling with a single, potentially faster, tensor operation using einops.repeat.
  • Code Simplification: The implementation of block_quant_dequant is simplified by removing the explicit loop structure and tile handling, resulting in more concise code.
  • New Dependency: The einops library is added as a dependency to facilitate the tensor repetition needed for the new dequantization logic.
  • Input Validation: Assertions were added to block_quant_dequant to ensure that the input tensor dimensions are perfectly divisible by the specified block size, which is required for the new einops-based approach.

Changelog

  • python/sglang/srt/layers/quantization/fp8_utils.py
    • Removed unused imports os and curses.flash.
    • Added import for the einops library.
    • Rewrote the block_quant_dequant function to use einops.repeat and tensor multiplication for dequantization, replacing the previous loop-based implementation.
    • Added assertions to block_quant_dequant to check for divisibility of input dimensions by block size.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Loops run slow and long,
Tensor math makes code strong,
Speed up, dequantize!

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to speed up the block_quant_dequant function by replacing a loop-based implementation with a vectorized approach using einops. This is a good optimization strategy and should lead to performance improvements.

The main point to clarify is the change in input validation: the new implementation asserts that input tensor dimensions must be perfectly divisible by the block sizes, whereas the previous implementation appeared to handle non-perfectly divisible inputs. Please confirm if this change in behavior is intentional. If so, the function's docstring should be updated to reflect this new precondition.

Summary of Findings

  • Change in Input Validation for block_quant_dequant: The function block_quant_dequant now explicitly asserts that the input tensor dimensions n and k must be perfectly divisible by block_n and block_k respectively. The previous implementation appeared to handle non-perfectly divisible inputs. This change needs clarification to ensure it's intentional and to update documentation accordingly if it is.
  • Unused Imports Removed: The imports os and curses.flash were removed. This is a good cleanup as they were unused in the context of the changed function. (Severity: low, not commented due to review settings)

Merge Readiness

This pull request introduces a good performance optimization by vectorizing the block_quant_dequant function. However, there's a significant change in input validation that needs clarification. If the new assertions for input divisibility are intended and all callers adhere to this, the change is acceptable, but the docstring should be updated. If the previous behavior of handling non-divisible inputs needs to be maintained, the current changes would be breaking.

I recommend addressing the point about input validation before merging. I am unable to approve pull requests, so please ensure other reviewers approve this change after the clarification.

Comment on lines 381 to 382
assert n % block_n == 0
assert k % block_k == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The new implementation introduces assertions n % block_n == 0 and k % block_k == 0. This is a stricter requirement compared to the previous version, which used min() in slicing (e.g., min((j + 1) * block_n, n)) and could handle cases where x_q_block dimensions were not perfectly divisible by block_size (as long as x_s had the correct number of scales for the tiles).

Could you clarify if this change in behavior is intentional?

  1. If x_q_block is always guaranteed to have dimensions divisible by block_size in all call sites, then these assertions make an implicit precondition explicit, which is good. In this case, it would be beneficial to update the function's docstring (lines 374-378) to mention these new preconditions for x_q_block and x_s shapes.
  2. If the function was previously expected to handle non-divisible shapes, this change might break existing callers or alter the function's contract. If the old behavior needs to be preserved, the einops approach might require padding or a different vectorization strategy.

This change is important as it affects the function's input contract.

@Alcanderian
Copy link
Collaborator

NO einops for multi-platform compatibility XD

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Jun 2, 2025

@Alcanderian Ah, that one does not support AMD? I thought it is a wrapper around pytorch...

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Jun 13, 2025

einops removed

@zhyncs zhyncs merged commit e3ec6bf into sgl-project:main Jun 13, 2025
3 of 48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants