-
-
Notifications
You must be signed in to change notification settings - Fork 10k
[Perf][fp8] Use CustomOp abstraction for fp8 quant for better perf #19830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @ProExpertProg, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly refactors and centralizes the FP8 quantization logic within the codebase. By introducing a new QuantFP8
abstraction, it provides a more unified, configurable, and maintainable approach to handling 8-bit floating-point operations, which are crucial for efficient model execution. The changes also include a new benchmark to validate the performance and correctness of these FP8 operations.
Highlights
- New FP8 Quantization Abstraction: Introduced a new
QuantFP8
class invllm/model_executor/layers/fp8_quantization.py
. This class provides a unified and configurable interface for FP8 quantization, supporting both per-token and per-tensor dynamic/static quantization. It encapsulates the logic for both native (PyTorch) and CUDA implementations. - Refactoring and Integration: Refactored existing FP8 quantization calls across various modules (e.g.,
Fp8LinearOp
, attention backends, and different quantization schemes like FBGEMM, ModelOpt, PTPC, Quark) to utilize the newQuantFP8
class. This replaces direct calls toops.scaled_fp8_quant
with calls to instances ofQuantFP8
, making the quantization logic more modular and consistent. - Improved Configurability: Enhanced the configurability of FP8 quantization strategies by introducing parameters like
act_quant_static
(for static vs. dynamic quantization) andact_quant_group_shape
(for per-token vs. per-tensor grouping) to theFp8LinearOp
andQuantFP8
classes. This allows for more flexible and optimized FP8 usage based on hardware capabilities and model requirements. - New FP8 Quantization Benchmark: Added a new benchmark script (
benchmarks/kernels/bench_per_token_quant_fp8.py
) to compare and measure the performance of per-token FP8 quantization using both PyTorch's native implementation and CUDA operations. This helps in verifying correctness and assessing performance gains.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new QuantFP8
class to encapsulate FP8 quantization logic and refactors existing code to use this class. The changes improve code organization and prepare for more flexible quantization schemes. However, there are several high and critical severity issues related to the correct initialization and usage of the new QuantFP8
class in various parts of the codebase, particularly in the attention backends and one of the compressed tensors schemes. Additionally, the QuantFP8
class itself has limitations in its native implementation and an assertion that might be too restrictive depending on supported quantization types. Addressing these issues is crucial for correctness and stability.
This pull request has merge conflicts that must be resolved before it can be |
6679870
to
bb4ba6f
Compare
2442db7
to
b284931
Compare
@gemini-code-assist review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a CustomOp
abstraction for FP8 quantization, which is a significant improvement for performance and code structure. The refactoring is extensive, touching many files, but it is applied consistently and effectively. The new QuantFP8
op and the updated Fp8LinearOp
make the quantization logic cleaner and more maintainable. The performance benchmarks in the PR description are very detailed and clearly demonstrate the benefits of this change.
I have a couple of minor suggestions in the new benchmark file to improve code readability by avoiding shadowing Python built-in functions. Overall, this is a high-quality contribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love to see GroupShape used more widely and just the general clean-up! And a nice perf boost. Thanks for doing this!
Signed-off-by: Luka Govedic <lgovedic@redhat.com>
Signed-off-by: Luka Govedic <lgovedic@redhat.com>
Signed-off-by: Luka Govedic <lgovedic@redhat.com>
Signed-off-by: Luka Govedic <lgovedic@redhat.com>
Signed-off-by: Luka Govedic <lgovedic@redhat.com>
6275f19
to
404a4b2
Compare
…llm-project#19830) Signed-off-by: Luka Govedic <lgovedic@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com>
…llm-project#19830) Signed-off-by: Luka Govedic <lgovedic@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
…llm-project#19830) Signed-off-by: Luka Govedic <lgovedic@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com>
…llm-project#19830) Signed-off-by: Luka Govedic <lgovedic@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
…llm-project#19830) Signed-off-by: Luka Govedic <lgovedic@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com>
…llm-project#19830) Signed-off-by: Luka Govedic <lgovedic@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com>
…llm-project#19830) Signed-off-by: Luka Govedic <lgovedic@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…llm-project#19830) Signed-off-by: Luka Govedic <lgovedic@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…llm-project#19830) Signed-off-by: Luka Govedic <lgovedic@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com>
…llm-project#19830) Signed-off-by: Luka Govedic <lgovedic@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Purpose: vllm-project#19830 added QuantFp8, which uses the CustomOp abstraction to implement fp8 quantization in both CUDA and torch, allowing Inductor to achieve superior performance over the CUDA ops (which are unoptimized and also do not fuse by default). However, the class has to be instantiated during init, and MoE uses are currently in util free functions many levels deep. Those need to be mildly rearchitected to take advantage of the new abstraction. The main changes include: - Add a MoEInputQuantizer class with init/forward (and callable) interface for FP8/INT8/NVFP4/MXFP4 activation quantization. - Make the class abstract out details on decision of the quant fp8 op to be used Signed-off-by: Rohan Jagtap <rohanj30@icloud.com>
Purpose: vllm-project#19830 added QuantFp8, which uses the CustomOp abstraction to implement fp8 quantization in both CUDA and torch, allowing Inductor to achieve superior performance over the CUDA ops (which are unoptimized and also do not fuse by default). However, the class has to be instantiated during init, and MoE uses are currently in util free functions many levels deep. Those need to be mildly rearchitected to take advantage of the new abstraction. The main changes include: - Add a MoEInputQuantizer class with init/forward (and callable) interface for FP8/INT8/NVFP4/MXFP4 activation quantization. - Make the class abstract out details on decision of the quant fp8 op to be used Signed-off-by: Rohan Jagtap <rohanj30@icloud.com>
Purpose: vllm-project#19830 added QuantFp8, which uses the CustomOp abstraction to implement fp8 quantization in both CUDA and torch, allowing Inductor to achieve superior performance over the CUDA ops (which are unoptimized and also do not fuse by default). However, the class has to be instantiated during init, and MoE uses are currently in util free functions many levels deep. Those need to be mildly rearchitected to take advantage of the new abstraction. The changes in this commit include: - Adapt the `MoEPrepareAndFinalizeNoEP` class to the custom op wrapper changes Signed-off-by: Rohan Jagtap <rohanj30@icloud.com>
Purpose: vllm-project#19830 added QuantFp8, which uses the CustomOp abstraction to implement fp8 quantization in both CUDA and torch, allowing Inductor to achieve superior performance over the CUDA ops (which are unoptimized and also do not fuse by default). However, the class has to be instantiated during init, and MoE uses are currently in util free functions many levels deep. Those need to be mildly rearchitected to take advantage of the new abstraction. The changes in this commit include: - Refactor docstrings to use backticks for code - refactor formatting for certain methods to match other calls Signed-off-by: Rohan Jagtap <rohanj30@icloud.com>
Purpose: vllm-project#19830 added QuantFp8, which uses the CustomOp abstraction to implement fp8 quantization in both CUDA and torch, allowing Inductor to achieve superior performance over the CUDA ops (which are unoptimized and also do not fuse by default). However, the class has to be instantiated during init, and MoE uses are currently in util free functions many levels deep. Those need to be mildly rearchitected to take advantage of the new abstraction. The changes in this commit include: - Minor refactoring to keep code consistent Signed-off-by: Rohan Jagtap <rohanj30@icloud.com>
Purpose: vllm-project#19830 added QuantFp8, which uses the CustomOp abstraction to implement fp8 quantization in both CUDA and torch, allowing Inductor to achieve superior performance over the CUDA ops (which are unoptimized and also do not fuse by default). However, the class has to be instantiated during init, and MoE uses are currently in util free functions many levels deep. Those need to be mildly rearchitected to take advantage of the new abstraction. The changes in this commit include: - Fix code smell for unhandled None object for quantizer Signed-off-by: Rohan Jagtap <rohanj30@icloud.com>
…llm-project#19830) Signed-off-by: Luka Govedic <lgovedic@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com>
…llm-project#19830) Signed-off-by: Luka Govedic <lgovedic@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
This PR refactors FP8 quantization kernels to use the
CustomOp
abstraction, allowing Inductor to generate fast(er) Triton kernels and automatically perform fusion withRMSNorm
andSiluMul
(already implemented asCustomOp
s). This gives significant speedups, demonstrated below.All forward pass code for dense linear layers instantiates
QuantFP8
inside layer's__init__
method and calls it instead of callingcustom_ops.scaled_fp8_quant
directly. Non-forward code (e.g. weight quantization duringprocess_weights_after_loading
) still uses the CUDA kernel as compilation is not worth it for a single execution. This could be changed in the future if necessary.I also moved the
GroupShape
utility fromfusion.py
toquant_utils.py
as it's now more widely used, and fixed up the manual fusion tests (which might now have to enable the fp8 custom op).MoE, attention layers, and INT8 quantization can be done in the future. MoE specifically is difficult because the call to scaled_fp8_quant is nested in many levels of free methods, so all of those would need to become objects. Attention will require a custom compilation utility as the call to the op is hidden from torch.compile inside the
unified_attention
custom op.Test Plan
Running lm_eval on static per-tensor and dynamic per-token manually, and CI.
For performance, I ran a a serving sweep for dynamic per-token and static per-tensor, and a detailed latency sweep for dynamic per-token
Test Result
For dynamic per-token quantization, I ran a full latency sweep with all combinations of custom ops enabled/disabled. When
QuantFP8
is disabled and (at least) one ofRMSNorm
/SiluMul
is disabled, custom fusion passes can run, so I tried those configs with and without fusion as well.Speedup of various configurations versus all custom ops enabled below. Note that custom-fp8 is the default on main (fp8 custom op enabled, others disabled). This is all run on a B200 machine.
Serving sweep for
redhatai/meta-llama-3.1-8B-Instruct-FP8
📊 TTFT Median (ms)
📊 ITL Median (ms)
📊 TPOT Median (ms)
Serving sweep for
redhatai/meta-llama-3.1-8B-Instruct-FP8-dynamic
📊 TTFT Median (ms)
📊 ITL Median (ms)
📊 TPOT Median (ms)
Serving sweeps on H100 for dynamic/static models
📊 TTFT Median (ms)
📊 TPOT Median (ms)
📊 ITL Median (ms)
Latency sweep for
redhatai/meta-llama-3.1-8B-Instruct-FP8-dynamic
We can see that
torch
outperforms all other implementations, includingcustom-fp8
(current default on main).Speedup Comparison (vs. custom-all): decode (1 input token, 64 output tokens, batch-size 16-256)
Speedup Comparison (vs. custom-all): mixe (1024 input tokens, 64 output tokens, batch-size 4-128)
Speedup Comparison (vs. custom-all): prefill (512-2048 input tokens, 1 output token, batch-size 1)
lm_eval
Dynamic per-token (CUDA kernel)
Dynamic per-token (torch implementation)
Static per-tensor (CUDA kernel)
Static per-tensor (torch implementation)
(Optional) Documentation Update