Add blocked quantization mode for De/QuantizeLinear #5812

galagam · 2023-12-18T17:18:14Z

Description

Blocked quantization divides input tensors into smaller 1-D blocks that share the scale and zero-point.
Scale and zero point should have the same rank as the input tensor.

Motivation and Context

Blocked quantization (sometimes referred to as group-quantization) is described in numerous papers. By allowing finer granularity of the quantization parameters, accuracy results improve, even under extreme compression factors.
Blocked quantization is an inherent part of the Microscaling (MX)-compliant data formats. While MX-types are not yet adopted by the ONNX standard yet, adding support for blocked quantization is a first step in this direction.

References:
OCP Microscaling Formats (MX) Specification v1.0
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
8-bit Optimizers via Block-wise Quantization

codecov · 2023-12-18T17:27:10Z

Codecov Report

Attention: 45 lines in your changes are missing coverage. Please review.

Comparison is base (a563b10) 56.44% compared to head (508d204) 56.46%.

Files	Patch %	Lines
onnx/backend/test/case/node/dequantizelinear.py	0.00%	15 Missing ⚠️
onnx/backend/test/case/node/quantizelinear.py	0.00%	15 Missing ⚠️
onnx/reference/ops/op_quantize_linear.py	75.00%	8 Missing and 2 partials ⚠️
onnx/reference/ops/op_dequantize_linear.py	66.66%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5812      +/-   ##
==========================================
+ Coverage   56.44%   56.46%   +0.02%     
==========================================
  Files         504      504              
  Lines       29883    29977      +94     
  Branches     4492     4505      +13     
==========================================
+ Hits        16866    16928      +62     
- Misses      12199    12233      +34     
+ Partials      818      816       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

onnx/reference/ops/_op_list.py

onnx/version_converter/adapters/q_dq_21_20.h

onnx/defs/quantization/defs.cc

onnx/reference/ops/op_quantize_linear.py

onnx/test/reference_evaluator_test.py

onnx/version_converter/adapters/q_dq_21_20.h

onnx/defs/quantization/defs.cc

onnx/reference/ops/op_dequantize_linear.py

onnx/test/version_converter_test.py

onnx/version_converter/adapters/q_dq_21_20.h

onnx/test/version_converter_test.py

Blocked quantization divides input tensors into smaller 1-D blocks that share the scale and zero-point. Scale and zero point should have the same rank as the input tensor. Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>

- Rephrase quantization definition - Test failure cases of blocked quantization - Raise ValueError instead of RuntimeError where applicable - Regenerate all auto-generated files - Lint fixes Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>

Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>

onnx/reference/ops/op_quantize_linear.py

onnx/version_converter/adapters/q_dq_21_20.h

+    }
+    if (node->hasAttribute(kblock_size)) {
+      if ((node->i(kblock_size) != 0)) {
+        ONNX_ASSERTM(false, "Blocked quantization is not supported for Opset Version %d.", target_version().version())


onnx/version_converter/adapters/q_dq_21_20.h

+    }
+    if (node->hasAttribute(kblock_size)) {
+      if ((node->i(kblock_size) != 0)) {
+        ONNX_ASSERTM(false, "Blocked quantization is not supported for Opset Version %d.", target_version().version())


The purpose of this change is to support dynamic input shape and block size that doesn't divide the input shape without a remainder. Default block_size is 0, and is interperted as no blocking. Otherwise, the scale and zero-point are replicated (attr) block_size times on the (attr) axis dimension. Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>

onnx/version_converter/adapters/q_dq_21_20.h

Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>

galagam · 2024-02-01T15:26:06Z

@gramalingam CI pipeline is failing due to #5894 (unrelated to my changes)
However, if there are no other issues - please approve so we can merge it once the CI pipeline is passing.

galagam · 2024-02-02T03:13:23Z

@justinchuby can you merge? I don't have permission to merge.

fengyuentau · 2024-06-12T07:17:56Z

Hello all, I wonder whether there are existing tools that can do blocked quantization with ONNX models?

galagam · 2024-06-13T05:34:50Z

Hello all, I wonder whether there are existing tools that can do blocked quantization with ONNX models?

@fengyuentau Check out TensorRT 10. Currently with partial support (INT4 weight-only quantization).

yufenglee · 2024-06-13T16:32:36Z

@fengyuentau , you can also try matmul_4bits_quantizer.py if you are using ORT. and we are adding Q/DQ support.

fengyuentau · 2024-06-14T05:45:35Z

Thank you @galagam @yufenglee for the information. Speaking of adding Q/DQ support, do you have the link to the pull request?

### Description Blocked quantization divides input tensors into smaller 1-D blocks that share the scale and zero-point. Scale and zero point should have the same rank as the input tensor. ### Motivation and Context Blocked quantization (sometimes referred to as group-quantization) is described in numerous papers. By allowing finer granularity of the quantization parameters, accuracy results improve, even under extreme compression factors. Blocked quantization is an inherent part of the Microscaling (MX)-compliant data formats. While MX-types are not yet adopted by the ONNX standard yet, adding support for blocked quantization is a first step in this direction. References: [OCP Microscaling Formats (MX) Specification v1.0](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) [AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/pdf/2306.00978.pdf) [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323) [8-bit Optimizers via Block-wise Quantization](https://arxiv.org/abs/2110.02861) --------- Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com> Signed-off-by: Linsho Kaku <linsho@preferred.jp>

galagam requested review from a team as code owners December 18, 2023 17:18

github-advanced-security bot found potential problems Dec 18, 2023

View reviewed changes

onnx/reference/ops/_op_list.py Show resolved Hide resolved

onnx/reference/ops/_op_list.py Show resolved Hide resolved

github-advanced-security bot found potential problems Dec 18, 2023

View reviewed changes

onnx/version_converter/adapters/q_dq_21_20.h Fixed Show resolved Hide resolved

onnx/version_converter/adapters/q_dq_21_20.h Fixed Show resolved Hide resolved

galagam force-pushed the blocked-quantization branch from d3b2cfa to e8e59bc Compare December 19, 2023 11:54

justinchuby added the topic: operator Issues related to ONNX operators label Dec 22, 2023

galagam force-pushed the blocked-quantization branch from e8e59bc to c28e0d7 Compare January 2, 2024 15:36

xadupre reviewed Jan 3, 2024

View reviewed changes

onnx/defs/quantization/defs.cc Show resolved Hide resolved

xadupre reviewed Jan 3, 2024

View reviewed changes

onnx/reference/ops/op_quantize_linear.py Outdated Show resolved Hide resolved

xadupre reviewed Jan 3, 2024

View reviewed changes

onnx/test/reference_evaluator_test.py Show resolved Hide resolved

justinchuby added this to the 1.16 milestone Jan 4, 2024