Skip to content

[RFC][API-Unstable]A16W4 on XPU Device #153019

@liangan1

Description

@liangan1

🚀 The feature, motivation and pitch

Motivation

As you know, the generation task with LLM is autoregressive and the GEMM computation of the decoding stage for the next token is memory bound. The weight only quantization with A16W4 has been widely adopted by the LLM inference, especially for the client GPU with single-user inference. It can help to reduce the memory consumption and reduce the memory footprint to speedup the inference.

Plan

We are working the XPU device enabling in torchAO. TorchAO provides multiple quantization recipes for A4W16, e.g., RTN, GPTQ and AWQ. The goal of torch-2.8 is to provide a performant and comprise int4 solution with RTN and the AWQ enabling is a stretch goal. The RTN can provide the reasonable output in the generation task but there may be a big accuracy gap with a specific dataset and metric. 

For GPTQ/AWQ, At the current stage, we want to prioritize he AWQ. In the kernel sides, the int4 matmul with oneDNN should be reused by RTN/GPTQ/AWQ.  There should be no performance gap with different algorithm. Even with RTN, we also use the group wise quantization. The granularity is similar with different algorithm.

PR List

Status

Data Type /Algorithm AWQ RTN GPTQ
A(fp16)W4 Yes Yes Yes
A(bf16)W4 Yes Yes Yes

Release Information

This feature allows users to leverage A16W4 weight-only quantization to run LLM inference on Intel GPU with TorchAO to reduce memory consumption and boost inference speed. It supports both BF16 and FP16 activations and additionally allows users to select between RTN (Rounding-to-Nearest) or AWQ (Automatic Weight Quantization) methods based on the accuracy requirements of specific scenarios.

Alternatives

No response

Additional context

No response

cc @gujinghui @EikanWang @fengyuan14 @guangyey

Metadata

Metadata

Assignees

Labels

module: xpuIntel XPU related issuesrelease-feature-requestThis tag is to mark Feature Tracked for PyTorch OSS ReleasestriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions