Skip to content

[Feature request] Add support for Int4 data-type #5776

@galagam

Description

@galagam

System information

onnx v1.16.0
main top-of-tree

What is the problem that this feature solves?

LLMs are dominating the DL research and development today. Recent networks require 10s of GBs, and the main inference bottleneck is memory access (capacity and bandwidth). Recent papers are showing promising results using sub-byte data types, specifically weight-only quantization using int4.
The motivation behind weight-only quantization is improving performance (larger batches improve MMA utilization; smaller memory BW benefits memory-bound generation) as well as enabling large models that would otherwise not be able to execute on a single GPU.
By quantizing data to 4 bits, we can both reduce the model size and accelerate memory-bound inference use cases significantly.

Alternatives considered

N/A

Describe the feature

Every two int4 elements (i_0, i_1) will packed into a single uint8, as follows:
buffer = i_0 << 4 | i_1 & 0x0F

Tensors with an odd number of elements are not supported.

  • Add data-type TensorProto.INT4
  • Add helper functions to pack and unpack int4
  • Add support in QuantizeLinear, DequantizeLinear and some shape ops

Will this influence the current api (Y/N)?

Yes
Additional data type available
Adding int4 to a subset of the operators, including QuantizeLinear, DequantizeLinear (Optionally: Shape, Size, Transpose, Reshape, Constant).

Feature Area

data-types, operators

Are you willing to contribute it (Y/N)

Yes

Notes

Relevant papers using int4 quantization for LLMs:
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions