-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Add blocked quantization mode for De/QuantizeLinear #5812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #5812 +/- ##
==========================================
+ Coverage 56.44% 56.46% +0.02%
==========================================
Files 504 504
Lines 29883 29977 +94
Branches 4492 4505 +13
==========================================
+ Hits 16866 16928 +62
- Misses 12199 12233 +34
+ Partials 818 816 -2 ☔ View full report in Codecov by Sentry. |
d3b2cfa
to
e8e59bc
Compare
e8e59bc
to
c28e0d7
Compare
4f268f8
to
ac4ceed
Compare
cc5634a
to
cfda0c1
Compare
Blocked quantization divides input tensors into smaller 1-D blocks that share the scale and zero-point. Scale and zero point should have the same rank as the input tensor. Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
- Rephrase quantization definition - Test failure cases of blocked quantization - Raise ValueError instead of RuntimeError where applicable - Regenerate all auto-generated files - Lint fixes Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
aab0de5
to
f149c8e
Compare
} | ||
if (node->hasAttribute(kblock_size)) { | ||
if ((node->i(kblock_size) != 0)) { | ||
ONNX_ASSERTM(false, "Blocked quantization is not supported for Opset Version %d.", target_version().version()) |
Check notice
Code scanning / CodeQL
Too many arguments to formatting function
} | ||
if (node->hasAttribute(kblock_size)) { | ||
if ((node->i(kblock_size) != 0)) { | ||
ONNX_ASSERTM(false, "Blocked quantization is not supported for Opset Version %d.", target_version().version()) |
Check notice
Code scanning / CodeQL
Too many arguments to formatting function
f149c8e
to
3526443
Compare
The purpose of this change is to support dynamic input shape and block size that doesn't divide the input shape without a remainder. Default block_size is 0, and is interperted as no blocking. Otherwise, the scale and zero-point are replicated (attr) block_size times on the (attr) axis dimension. Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
3526443
to
cdb5f49
Compare
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
@gramalingam CI pipeline is failing due to #5894 (unrelated to my changes) |
This comment was marked as resolved.
This comment was marked as resolved.
@justinchuby can you merge? I don't have permission to merge. |
Hello all, I wonder whether there are existing tools that can do blocked quantization with ONNX models? |
@fengyuentau Check out TensorRT 10. Currently with partial support (INT4 weight-only quantization). |
@fengyuentau , you can also try matmul_4bits_quantizer.py if you are using ORT. and we are adding Q/DQ support. |
Thank you @galagam @yufenglee for the information. Speaking of |
### Description Blocked quantization divides input tensors into smaller 1-D blocks that share the scale and zero-point. Scale and zero point should have the same rank as the input tensor. ### Motivation and Context Blocked quantization (sometimes referred to as group-quantization) is described in numerous papers. By allowing finer granularity of the quantization parameters, accuracy results improve, even under extreme compression factors. Blocked quantization is an inherent part of the Microscaling (MX)-compliant data formats. While MX-types are not yet adopted by the ONNX standard yet, adding support for blocked quantization is a first step in this direction. References: [OCP Microscaling Formats (MX) Specification v1.0](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) [AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/pdf/2306.00978.pdf) [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323) [8-bit Optimizers via Block-wise Quantization](https://arxiv.org/abs/2110.02861) --------- Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com> Signed-off-by: Linsho Kaku <linsho@preferred.jp>
Description
Blocked quantization divides input tensors into smaller 1-D blocks that share the scale and zero-point.
Scale and zero point should have the same rank as the input tensor.
Motivation and Context
Blocked quantization (sometimes referred to as group-quantization) is described in numerous papers. By allowing finer granularity of the quantization parameters, accuracy results improve, even under extreme compression factors.
Blocked quantization is an inherent part of the Microscaling (MX)-compliant data formats. While MX-types are not yet adopted by the ONNX standard yet, adding support for blocked quantization is a first step in this direction.
References:
OCP Microscaling Formats (MX) Specification v1.0
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
8-bit Optimizers via Block-wise Quantization