[RFC][API-Unstable]A16W4 on XPU Device

### 🚀 The feature, motivation and pitch

## Motivation

As you know, the generation task with LLM is autoregressive and the GEMM computation of the decoding stage for the next token is memory bound. The weight only quantization with A16W4 has been widely adopted by the LLM inference, especially for the client GPU with single-user inference. It can help to reduce the memory consumption and reduce the memory footprint to speedup the inference. 

## Plan
We are working the XPU device enabling in torchAO. TorchAO provides multiple quantization recipes for A4W16, e.g., RTN, GPTQ and AWQ. **The goal of torch-2.8 is to provide a performant and comprise int4 solution with RTN and the AWQ enabling is a stretch goal**. The RTN can provide the reasonable output in the generation task but there may be a big accuracy gap with a specific dataset and metric.  

For GPTQ/AWQ, At the current stage, we want to prioritize he AWQ. In the kernel sides, the int4 matmul with oneDNN should be reused by RTN/GPTQ/AWQ.  There should be no performance gap with different algorithm. Even with RTN, we also use the group wise quantization. The granularity is similar with different algorithm.

## PR List

- [x] #137566
- [x] #147693
- [x] [INT4 XPU enabling in torchAO](https://github.com/pytorch/ao/pull/1577)
- [x] [Enable AWQ+ XPU in torchAO #2248](https://github.com/pytorch/ao/pull/2248)
- [x] [Enable FP16 activation for WOQ in torchAO int4 #2240 Merged](https://github.com/pytorch/ao/pull/2240/)

## Status

| Data Type /Algorithm       | AWQ  | RTN  |GPTQ 
|------------------|------|------|------|
| **A(fp16)W4**    | Yes  | Yes   | Yes|
| **A(bf16)W4**    | Yes  | Yes  | Yes  |

## Release Information
This feature allows users to leverage A16W4 weight-only quantization to run LLM inference on Intel GPU with TorchAO to reduce memory consumption and boost inference speed. It supports both BF16 and FP16 activations and additionally allows users to select between RTN (Rounding-to-Nearest) or AWQ (Automatic Weight Quantization) methods based on the accuracy requirements of specific scenarios.

### Alternatives

_No response_

### Additional context

_No response_

cc @gujinghui @EikanWang @fengyuan14 @guangyey

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC][API-Unstable]A16W4 on XPU Device #153019

🚀 The feature, motivation and pitch

Motivation

Plan

PR List

Status

Release Information

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC][API-Unstable]A16W4 on XPU Device #153019

Description

🚀 The feature, motivation and pitch

Motivation

Plan

PR List

Status

Release Information

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions