Skip to content

Add blocked quantization mode for De/QuantizeLinear #5812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Feb 2, 2024

Conversation

galagam
Copy link
Contributor

@galagam galagam commented Dec 18, 2023

Description

Blocked quantization divides input tensors into smaller 1-D blocks that share the scale and zero-point.
Scale and zero point should have the same rank as the input tensor.

Motivation and Context

Blocked quantization (sometimes referred to as group-quantization) is described in numerous papers. By allowing finer granularity of the quantization parameters, accuracy results improve, even under extreme compression factors.
Blocked quantization is an inherent part of the Microscaling (MX)-compliant data formats. While MX-types are not yet adopted by the ONNX standard yet, adding support for blocked quantization is a first step in this direction.

References:
OCP Microscaling Formats (MX) Specification v1.0
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
8-bit Optimizers via Block-wise Quantization

@galagam galagam requested review from a team as code owners December 18, 2023 17:18
Copy link

codecov bot commented Dec 18, 2023

Codecov Report

Attention: 45 lines in your changes are missing coverage. Please review.

Comparison is base (a563b10) 56.44% compared to head (508d204) 56.46%.

Files Patch % Lines
onnx/backend/test/case/node/dequantizelinear.py 0.00% 15 Missing ⚠️
onnx/backend/test/case/node/quantizelinear.py 0.00% 15 Missing ⚠️
onnx/reference/ops/op_quantize_linear.py 75.00% 8 Missing and 2 partials ⚠️
onnx/reference/ops/op_dequantize_linear.py 66.66% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5812      +/-   ##
==========================================
+ Coverage   56.44%   56.46%   +0.02%     
==========================================
  Files         504      504              
  Lines       29883    29977      +94     
  Branches     4492     4505      +13     
==========================================
+ Hits        16866    16928      +62     
- Misses      12199    12233      +34     
+ Partials      818      816       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@galagam galagam force-pushed the blocked-quantization branch from d3b2cfa to e8e59bc Compare December 19, 2023 11:54
@justinchuby justinchuby added the topic: operator Issues related to ONNX operators label Dec 22, 2023
@galagam galagam force-pushed the blocked-quantization branch from e8e59bc to c28e0d7 Compare January 2, 2024 15:36
@justinchuby justinchuby added this to the 1.16 milestone Jan 4, 2024
@galagam galagam force-pushed the blocked-quantization branch 2 times, most recently from 4f268f8 to ac4ceed Compare January 8, 2024 17:08
@galagam galagam force-pushed the blocked-quantization branch from cc5634a to cfda0c1 Compare January 10, 2024 08:56
Blocked quantization divides input tensors into smaller 1-D blocks
that share the scale and zero-point.
Scale and zero point should have the same rank as the input tensor.

Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
- Rephrase quantization definition
- Test failure cases of blocked quantization
- Raise ValueError instead of RuntimeError where applicable
- Regenerate all auto-generated files
- Lint fixes

Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
@galagam galagam force-pushed the blocked-quantization branch from aab0de5 to f149c8e Compare January 30, 2024 20:32
}
if (node->hasAttribute(kblock_size)) {
if ((node->i(kblock_size) != 0)) {
ONNX_ASSERTM(false, "Blocked quantization is not supported for Opset Version %d.", target_version().version())

Check notice

Code scanning / CodeQL

Too many arguments to formatting function

Format for barf (in a macro expansion) expects 5 arguments but given 6
}
if (node->hasAttribute(kblock_size)) {
if ((node->i(kblock_size) != 0)) {
ONNX_ASSERTM(false, "Blocked quantization is not supported for Opset Version %d.", target_version().version())

Check notice

Code scanning / CodeQL

Too many arguments to formatting function

Format for barf (in a macro expansion) expects 5 arguments but given 6
@galagam galagam force-pushed the blocked-quantization branch from f149c8e to 3526443 Compare January 30, 2024 20:49
The purpose of this change is to support dynamic input shape and block size that doesn't divide the input shape without a remainder.
Default block_size is 0, and is interperted as no blocking.
Otherwise, the scale and zero-point are replicated (attr) block_size times on the (attr) axis dimension.

Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
@galagam galagam force-pushed the blocked-quantization branch from 3526443 to cdb5f49 Compare January 30, 2024 20:50
Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
@galagam
Copy link
Contributor Author

galagam commented Feb 1, 2024

@gramalingam CI pipeline is failing due to #5894 (unrelated to my changes)
However, if there are no other issues - please approve so we can merge it once the CI pipeline is passing.

@justinchuby

This comment was marked as resolved.

@galagam
Copy link
Contributor Author

galagam commented Feb 2, 2024

@justinchuby can you merge? I don't have permission to merge.

@justinchuby justinchuby added this pull request to the merge queue Feb 2, 2024
Merged via the queue into onnx:main with commit d229258 Feb 2, 2024
@fengyuentau
Copy link

Hello all, I wonder whether there are existing tools that can do blocked quantization with ONNX models?

@galagam
Copy link
Contributor Author

galagam commented Jun 13, 2024

Hello all, I wonder whether there are existing tools that can do blocked quantization with ONNX models?

@fengyuentau Check out TensorRT 10. Currently with partial support (INT4 weight-only quantization).

@yufenglee
Copy link
Contributor

@fengyuentau , you can also try matmul_4bits_quantizer.py if you are using ORT. and we are adding Q/DQ support.

@fengyuentau
Copy link

Thank you @galagam @yufenglee for the information. Speaking of adding Q/DQ support, do you have the link to the pull request?

linshokaku pushed a commit to linshokaku/onnx that referenced this pull request Oct 2, 2024
### Description
Blocked quantization divides input tensors into smaller 1-D blocks that
share the scale and zero-point.
Scale and zero point should have the same rank as the input tensor.

### Motivation and Context
Blocked quantization (sometimes referred to as group-quantization) is
described in numerous papers. By allowing finer granularity of the
quantization parameters, accuracy results improve, even under extreme
compression factors.
Blocked quantization is an inherent part of the Microscaling
(MX)-compliant data formats. While MX-types are not yet adopted by the
ONNX standard yet, adding support for blocked quantization is a first
step in this direction.

References:
[OCP Microscaling Formats (MX) Specification
v1.0](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf)
[AWQ: Activation-aware Weight Quantization for LLM Compression and
Acceleration](https://arxiv.org/pdf/2306.00978.pdf)
[GPTQ: Accurate Post-Training Quantization for Generative Pre-trained
Transformers](https://arxiv.org/abs/2210.17323)
[8-bit Optimizers via Block-wise
Quantization](https://arxiv.org/abs/2110.02861)

---------

Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
Signed-off-by: Linsho Kaku <linsho@preferred.jp>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
review needed: operators approvers Require reviews from members of operators-approvers topic: operator Issues related to ONNX operators
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants