-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Refactor DeepGEMM integration #7150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request refactors the integration of the DeepGEMM library into SGLang by creating a dedicated wrapper package. This change improves code organization, centralizes DeepGEMM-specific logic, and prepares the codebase for potential future updates or alternative implementations by abstracting the direct dependency.
Highlights
- DeepGEMM Wrapper: Introduced a new
deep_gemm_wrapper
package (python/sglang/srt/layers/quantization/deep_gemm_wrapper
) to encapsulate the integration logic for the DeepGEMM library. This centralizes DeepGEMM-specific code. - Code Relocation: Moved the majority of the DeepGEMM JIT compilation and utility functions from the old
python/sglang/srt/layers/quantization/deep_gemm.py
file into the newdeep_gemm_wrapper/compile_utils.py
. - Entrypoint Module: Created
deep_gemm_wrapper/entrypoint.py
to serve as the public interface for DeepGEMM functionality within SGLang, re-exporting wrapped versions of DeepGEMM functions likegrouped_gemm_nt_f8f8bf16_masked
,grouped_gemm_nt_f8f8bf16_contig
, andgemm_nt_f8f8bf16
. - Configuration Centralization: Added
deep_gemm_wrapper/configurer.py
to handle the logic for determining if DeepGEMM JIT is enabled based on environment variables and device capabilities, definingENABLE_JIT_DEEPGEMM
and version flags. - Utility Functions: Copied
align
andceil_div
utility functions, previously potentially implicitly available or imported from DeepGEMM, into a newpython/sglang/math_utils.py
file for broader use. - Usage Updates: Updated all files that previously imported directly from
deep_gemm
or the oldsglang.srt.layers.quantization.deep_gemm
to now import from the newdeep_gemm_wrapper
.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively refactors the DeepGEMM integration by introducing a dedicated deep_gemm_wrapper
module. This change significantly improves modularity and clarity by centralizing DeepGEMM-related configurations, JIT compilation logic, and API access. The updates across various files to use this new wrapper are consistent and well-executed.
I've identified a few areas for potential improvement, mainly concerning the use of magic numbers and addressing existing TODO comments that were touched by this refactor. These suggestions aim to further enhance the code's maintainability and readability.
gateup_output, | ||
masked_m, | ||
expected_m, | ||
recipe=(1, 128, 128) if deep_gemm_wrapper.DEEPGEMM_V202506 else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recipe
tuple (1, 128, 128)
is used here and on line 1252. If this recipe is specific and optimal for this DeepGEMM version, consider defining it as a named constant at the module level or within the deep_gemm_wrapper
to improve readability and maintainability. This avoids magic numbers directly in the function call.
# TODO hard code 128 block quant,use fp8 communication | ||
hidden_states = sglang_per_token_group_quant_fp8(hidden_states, 128) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO
comment indicates that 128
is a hardcoded block quantization size. If this value is specific to the current DeepGEMM JIT setup or FP8 communication strategy, consider making it a named constant (e.g., FP8_COMM_BLOCK_QUANT_SIZE = 128
) defined in a relevant configuration spot (perhaps within deep_gemm_wrapper.configurer
or a constants module). This would improve clarity and make it easier to adjust if needed in the future.
_INITIALIZATION_DICT: Dict[Tuple[DeepGemmKernelType, int, int, int], bool] = dict() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO
suggests improving the naming of _compile_warning_1
. A more descriptive name could enhance readability. For example, _log_pre_compilation_warning
or _warn_jit_pre_compile_start
might better convey its purpose.
_INITIALIZATION_DICT: Dict[Tuple[DeepGemmKernelType, int, int, int], bool] = dict() | |
def _log_pre_compilation_warning(): |
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Motivation
Modifications
Checklist