-
Notifications
You must be signed in to change notification settings - Fork 25.1k
Description
🚀 The feature, motivation and pitch
Motivation
As you know, the generation task with LLM is autoregressive and the GEMM computation of the decoding stage for the next token is memory bound. The weight only quantization with A16W4 has been widely adopted by the LLM inference, especially for the client GPU with single-user inference. It can help to reduce the memory consumption and reduce the memory footprint to speedup the inference.
Plan
We are working the XPU device enabling in torchAO. TorchAO provides multiple quantization recipes for A4W16, e.g., RTN, GPTQ and AWQ. The goal of torch-2.8 is to provide a performant and comprise int4 solution with RTN and the AWQ enabling is a stretch goal. The RTN can provide the reasonable output in the generation task but there may be a big accuracy gap with a specific dataset and metric.
For GPTQ/AWQ, At the current stage, we want to prioritize he AWQ. In the kernel sides, the int4 matmul with oneDNN should be reused by RTN/GPTQ/AWQ. There should be no performance gap with different algorithm. Even with RTN, we also use the group wise quantization. The granularity is similar with different algorithm.
PR List
- [Intel GPU] int4 WOQ gemm XPU Support #137566
- [Intel GPU] OneDNN primitive cache support for Int4 WOQ gemm on XPU #147693
- INT4 XPU enabling in torchAO
- Enable AWQ+ XPU in torchAO #2248
- Enable FP16 activation for WOQ in torchAO int4 #2240 Merged
Status
Data Type /Algorithm | AWQ | RTN | GPTQ |
---|---|---|---|
A(fp16)W4 | Yes | Yes | Yes |
A(bf16)W4 | Yes | Yes | Yes |
Release Information
This feature allows users to leverage A16W4 weight-only quantization to run LLM inference on Intel GPU with TorchAO to reduce memory consumption and boost inference speed. It supports both BF16 and FP16 activations and additionally allows users to select between RTN (Rounding-to-Nearest) or AWQ (Automatic Weight Quantization) methods based on the accuracy requirements of specific scenarios.
Alternatives
No response
Additional context
No response