-
-
Notifications
You must be signed in to change notification settings - Fork 9.9k
[Perf] Vectorize static / dynamic INT8 quant kernels #19233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf] Vectorize static / dynamic INT8 quant kernels #19233
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @yewentao256, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello everyone, gemini-code-assist here to provide a summary of this pull request. This PR focuses on performance optimizations for the int8 quantization kernels within vLLM. The primary goal is to improve the throughput and efficiency of models using int8 quantization, addressing issue #18866. The changes involve implementing vectorization techniques in the CUDA kernels responsible for quantization and adding a new benchmark script to measure the performance gains of these kernels.
Highlights
- Kernel Vectorization: The core optimization involves applying vectorization using
vec_n_t
(specificallyVEC_SIZE = 16
) to the CUDA kernels for static and dynamic int8 quantization (static_scaled_int8_quant_kernel
,dynamic_scaled_int8_quant_kernel
,dynamic_scaled_int8_azp_quant_kernel
). This allows the kernels to process multiple elements simultaneously, improving data throughput on the GPU. - New Benchmark Script: A new benchmark script (
benchmarks/kernels/bench_int8_gemm.py
) has been added. This script uses Triton to compare the performance (TFLOP/s) of BF16 GEMM against various int8 GEMM configurations (per-tensor, per-channel, with/without activation quantization time) across different batch sizes. - Performance Improvements: The PR description includes benchmark results showing notable improvements. End-to-end throughput increased from ~35 req/s to ~40 req/s for the tested model. Kernel benchmarks also show increased TFLOP/s for the optimized int8 kernels, particularly at larger batch sizes.
Changelog
- benchmarks/kernels/bench_int8_gemm.py
- Added a new Triton benchmark script to compare BF16 and int8 GEMM kernel performance.
- Includes benchmarking for different int8 quantization schemes (per-tensor, per-channel) and activation quantization timings.
- Configured to run benchmarks for specified models and tensor parallel sizes.
- csrc/quantization/compressed_tensors/int8_quant_kernels.cu
- Included
../vectorization.cuh
(line 6). - Implemented vectorization (
VEC_SIZE = 16
) instatic_scaled_int8_quant_kernel
for improved data processing (lines 111-142). - Implemented vectorization (
VEC_SIZE = 16
) indynamic_scaled_int8_quant_kernel
for the absmax scan and quantization loop (lines 194-251). - Added
MinMax
struct andMinMaxOp
for efficient min/max reduction in dynamic AZP kernel (lines 254-263). - Implemented vectorization (
VEC_SIZE = 16
) indynamic_scaled_int8_azp_quant_kernel
for the min/max scan and quantization loop (lines 269-348).
- Included
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Kernels run fast,
Int8 math, a quick blast,
Vector lanes wide,
Performance takes stride,
Latency's shadow is cast.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant performance optimizations to the int8 quantization kernels, backed by comprehensive benchmark results showing notable throughput improvements without accuracy degradation. The core changes involve vectorizing CUDA kernels, which is a standard and effective approach for performance enhancement on GPUs.
The new benchmark script (bench_int8_gemm.py
) is a valuable addition for quantifying these improvements across various configurations.
The CUDA kernel modifications in int8_quant_kernels.cu
appear solid, incorporating vectorization, pre-calculation of inverse scales, and optimized reduction strategies (like combined Min/Max). These changes are well-aligned with the reported performance gains.
One area for potential clarification/improvement is within the benchmark script concerning the setup for "per-channel" weight quantization scenarios. This is detailed in a specific comment.
Overall, great work on these optimizations!
Summary of Findings
- Benchmark Clarity for Per-Channel Quantization: The benchmark setup for 'per-channel' weight quantization modes (e.g.,
int8-channel-w-token-a
) appears to use a single scale factor for quantizing weights, which is then broadcast to a per-channel format for the matrix multiplication. This might not reflect true per-channel quantization where weights are quantized with distinct scales per channel. Clarification on whether this setup is intentional or if it should be adjusted for more realistic per-channel weight quantization benchmarking would be beneficial.
Merge Readiness
The core kernel optimizations in this PR are well-implemented and show significant performance benefits. There is one medium-severity point regarding the setup of 'per-channel' quantization benchmarks that would be good to clarify or address to ensure the benchmarks accurately reflect the intended scenarios. Once this point is resolved, the PR should be in good shape for merging. I am unable to approve the pull request, so please have others review and approve this code before merging.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me.
Would leave this to our kernel expert: @chenyang78, @mgoin, @tlrmchlsmth, and @LucasWilkinson to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work overall! I'd like to see the PR title and description to have more context on what exactly the changes are to achieve this "optimization", and what hardware it was tested on to give context to the performance numbers.
I also would like to see an expansion to the test cases in vllm/tests/kernels/quantization/test_int8_quant.py
to included these vectorization test cases from test_fp8_quant.py
HIDDEN_SIZES += list(range(1024, 1033)) # vectorized conversion edge cases |
@@ -107,16 +108,37 @@ template <typename scalar_t, typename scale_type> | |||
__global__ void static_scaled_int8_quant_kernel( | |||
scalar_t const* __restrict__ input, int8_t* __restrict__ out, | |||
scale_type const* scale_ptr, const int hidden_size) { | |||
int const VEC_SIZE = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be best to use constexpr here, in case it matters to the compiler for the #pragma unroll
later. At least I did this for fp8 i.e. constexpr size_t VEC_SIZE = 16;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks! Made a common function vectorize_with_alignment
for it to avoid this mistake
// reduce the min and max values across the block in one go | ||
using BlockReduce = cub::BlockReduce<MinMax, 1024>; | ||
__shared__ typename BlockReduce::TempStorage reduce_storage; | ||
MinMax block_min_max = | ||
BlockReduce(reduce_storage).Reduce(thread_min_max, MinMaxOp()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea!
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@mgoin Thanks for the review, your insight is really valuable! I did one step further and make a common function |
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, just a few thoughts:
# Dynamic per-token quant for A, static per-tensor quant for B | ||
scale_b = torch.tensor([1.0], device=device, dtype=torch.float32) | ||
b_int8, scale_b_int8, _ = vllm_scaled_int8_quant(b, scale_b) | ||
assert scale_b_int8.numel() == 1 | ||
a_int8, scale_a_int8, _ = vllm_scaled_int8_quant(a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is a bnechmark script but I think this could still be refactored. Perhaps a few functions/objects and a dictionary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea! I will have another PR optimizing this, because generally I reuse the code from benchmarks/kernels/bench_fp8_gemm.py
and we can update them together
def run_quant(): | ||
return vllm_scaled_mm(a_int8, b_int8, scale_a_int8, scale_b_int8, dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't run_quant
include scaled_int8_quant
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is for the noquant
branch, where we don't measure the time for activations quant (for comparison with the other branch)
struct MinMax { | ||
float min; | ||
float max; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this utility but you could put a lot more of the code in here I think:
- constructor: initialize to
numeric_limits
operator&=
/operator+=
(orvoid reduce
): combineMinMax
with anotherMinMax
, or add a value (updates both min and max members).
That way the code that uses it will be much cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea! Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the vectorization util! A few more comments
// 2. vectorize the main part | ||
for (int i = tid; i < num_vec; i += stride) { | ||
vout_t tmp; | ||
vec_op(tmp, v_in[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this just do the for loop here and call the vec_op
? Or maybe vec_op
can have a default parameter value DefaultVec<VEC_SIZE>{sca_op}
that loops the sca_op
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DefaultVec<VEC_SIZE> looks better for me, fixed
|
||
// 1. prefill the when it is unsafe to vectorize | ||
for (int i = tid; i < prefix_elems; i += stride) { | ||
sca_op(out[i], in[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make these return vout_t
/OutT
instead of passing in a parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this, is this functional kernel preferred (return something) or out as a param kernel preferred in vllm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the return, it's more functional, looks more clear, and I think it's a better general c++/CUDA practice, and the compiler is nominally better at optimizing it (in general - here I think everything will be inlined anyway). So I always default to returning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, and raise another issue for community discussion for a clearer code standard. If the return format is preferred, we can adjust all of the out-pram code in a new pr.
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the changes to remove the int64 index calculation and replacing static_cast<float>()
with float()
intentional? I think we should keep these otherwise
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great work iterating on this!
@yewentao256 can you please merge with latest main and push to fix some of the failing tests? |
Can we keep performance gains off the title of the PR? |
Nice work |
Sure, updated, thanks! |
Could you force merge into the main? @houseroad |
Essential Elements of an Effective PR Description Checklist
Motivation
The
scaled_int8_quant
kernel family still relied on scalar loads/stores, leaving significant throughput untapped. Issue #18866 highlighted vectorization as the next major opportunity. Thank @mgoin for the great issue!And thank @ztang2370 for the good start #19062, #19109 by @mgoin for the great incremental compilation document.
What’s in this PR
static_scaled_int8_quant_kernel
static_scaled_int8_azp_quant_kernel
dynamic_scaled_int8_quant_kernel
dynamic_scaled_int8_azp_quant_kernel
Make a abstraction in
csrc/quantization/vectorization_utils.cuh
so that this func can be reused later.Zero-point refactor
Benchmark tooling
benchmarks/kernels/bench_int8_gemm.py
for side-by-side BF16 vs. INT8 micro-benchmarks.Test
Tested on H100:
Accuracy validation:
lm_eval --model vllm --model_args pretrained=RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8,max_model_len=32768 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
End to end throughput:
vllm bench throughput --model RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 --load-format dummy --input-len 1000 --output-len 100 --max-model-len 32768
Kernel flops test with
bench_int8_gemm.py
Before:
After:
Unit test: