-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
System information
onnx v1.16.0
main top-of-tree
What is the problem that this feature solves?
LLMs are dominating the DL research and development today. Recent networks require 10s of GBs, and the main inference bottleneck is memory access (capacity and bandwidth). Recent papers are showing promising results using sub-byte data types, specifically weight-only quantization using int4.
The motivation behind weight-only quantization is improving performance (larger batches improve MMA utilization; smaller memory BW benefits memory-bound generation) as well as enabling large models that would otherwise not be able to execute on a single GPU.
By quantizing data to 4 bits, we can both reduce the model size and accelerate memory-bound inference use cases significantly.
Alternatives considered
N/A
Describe the feature
Every two int4 elements (i_0, i_1) will packed into a single uint8, as follows:
buffer = i_0 << 4 | i_1 & 0x0F
Tensors with an odd number of elements are not supported.
- Add data-type TensorProto.INT4
- Add helper functions to pack and unpack int4
- Add support in QuantizeLinear, DequantizeLinear and some shape ops
Will this influence the current api (Y/N)?
Yes
Additional data type available
Adding int4 to a subset of the operators, including QuantizeLinear, DequantizeLinear (Optionally: Shape, Size, Transpose, Reshape, Constant).
Feature Area
data-types, operators
Are you willing to contribute it (Y/N)
Yes
Notes
Relevant papers using int4 quantization for LLMs:
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers