[RFC] Intel GPU Upstreaming 

## TL;DR
This RFC document aims to propose and discuss the upstreaming of Intel GPU support in PyTorch. Our focus is on leveraging Intel's advancements in GPU technology to enhance PyTorch's performance and versatility. This initiative begins with the torch.compile integration as a primary step and marks a significant stride towards incorporating the Intel GPU as a robust computational backend in PyTorch. The RFC outlines key components and a high-level design strategy for this integration. By aligning with PyTorch 2.5 release goals, we aim to provide Intel GPU as a Beta feature to benefit a wide range of users and applications.

## Motivation
Intel GPUs significantly enhance workload performance, showcasing their strong capabilities in processing efficiency. We have obtained promising performance in [Intel® Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/xpu/2.0.110+xpu/). Therefore, we upstream features and optimizations buffered in IPEX to the stock PyTorch. This will facilitate the out-of-box experience on the Intel GPU platform for users and benefit the PyTorch community.

## Approach
Eventually, we will fully support Intel GPU in PyTorch for both `torch.compile` mode and eager mode. From an execution perspective, we will gradually achieve this goal starting with `torch.compile` as the initial step to align with the PyTorch 2.5 release as a Beta feature. The functionality and performance maturity will be driven by the dynamo benchmarks – HF, TIMM and TorchBench. Data-types-wise, we will support FP32, TF32, BF16, and FP16 first. Regarding other data types like INT8 and FP8, it is not within the scope of PyTorch 2.5. And we will support all these data types gradually.

In addition, we have added a dedicated dispatch key and device name to PyTorch for Intel GPU that can be found at [PyTorch GitHub](https://github.com/pytorch/pytorch/blob/main/c10/core/DispatchKey.cpp#L322). Regarding the components or features that we will upstream to the stock Pytorch for Intel GPU, they will be based on the “XPU” device tag.

In summary, the scope of the PyTorch 2.5 release for Intel GPU is as follows:
- Beta: torch.compile functionality and performance
  - Pass applicable UTs
  - Data types: FP32, TF32, BF16, and FP16
  - Proved by 3 benchmarks (HF + TorchBench + TIMM) at minimum
  - Larger model coverage as a stretch goal
- Intel® Data Center GPU Max Series Single device and Client GPU Series
- Linux and Windows
- Pip only with pre-built packages @ [PyTorch Download](https://download.pytorch.org/)
- No Libtorch

## Components
Since we are taking `torch.compile` as the initial step to align with the PyTorch 2.5 release, we have identified the Minimum Viable Product (MVP) set. It contains five crucial components as follows:
- Intel GPU Runtime – This component is the cornerstone to support other features. It will provide the device/runtime user interfaces like `Stream`, `Event`, `Device`, and so on.
- Minimum Set of Necessary Aten Operations – Although we take a `torch.compile` as the initial step to support Intel GPU, we still must implement a minimum Aten operation set to support situations as follows.
  - The aten operations that the Inductor backend fallback to aten like `convolution`, `matmul`, etc.
  - The aten operations that need to glue the kernels produced by the Inductor like `randn`, `empty`, `as_strided`, etc.
- OneDNN Library Integration – Regarding Intel XEON platforms, we have relied on oneDNN to deliver optimal performance for `convolution` and `gemm` operations. It is the same story for Intel GPU.
- Intel GPU Backend for Inductor –Intel GPU will integrate with `torch.compile` stack at the Inductor level by providing a Triton-based Inductor device backend. So, it is the crucial component to ensure `torch.compile` to support Intel GPU.
- CI/CD for Intel GPU – The CI/CD for Intel GPU customization is the infrastructure and gatekeeper to ensure the quality of all the above components.

Besides the five above crucial components, we will rely on the Intel GPU driver and SYCL to implement the Intel GPU runtime and necessary native aten operations.

## Design
In this section, we present a high-level design for each component. Regarding the detailed design, please refer to the dedicated RFC for each component for more information.

- Intel GPU Runtime
  
  Basically, PyTorch has defined `Device`, `Stream`, `Event`, `Guard`, `Generator`, and `Allocator` for GPUs. In terms of Intel GPU, we will follow the design and share the source code among different GPUs as much as possible. Besides the common code that we can share with other GPUs, the Intel GPU runtime component will add some SYCL implementations specific to Intel GPU.   
  
  Please refer to the dedicated [RFC](https://github.com/pytorch/pytorch/issues/114842) for detailed design elaboration.

- Minimum Set of Necessary Aten Operations

  We profiled HuggingFace, TIMM, and TorchBench and collected all the ATen operations that could not run into C++/OpenMP or Triton backend. These operations include elementwise, reduction, random, concat, scan, and indexing. We will implement these Aten operations by SYCL. Before that, we will integrate the SYCL compiler into the PyTorch build system.

  Please refer to the dedicated [RFC](https://github.com/pytorch/pytorch/issues/114835) for detailed design elaboration.

- OneDNN Library Integration

  PyTorch has integrated oneDNN as a git submodule for CPU support. For Intel GPU support, we will reuse the same oneDNN codebase. To minimize the integration effort, we intend to separately build oneDNN as a static library for Intel CPU and GPU, respectively. After that, we will statically link the two static libraries to `libtorch_cpu.so` and `libtorch_xpu.so`. This approach avoids directly modifying PyTorch code for oneDNN integration. It also allows us to produce binaries targeted specifically at CPU or GPU hardware while reusing oneDNN source code.

  Please refer to the dedicated [RFC](https://github.com/pytorch/pytorch/issues/114848) for detailed design elaboration.

- Intel GPU Backend for Inductor
  
  The Inductor already has a Triton backend to support GPUs, and we have enabled Triton to support Intel GPUs. This means we can easily extend Inductor to support Intel GPUs by building on top of the existing Triton backend. Only minimal code design and changes would be required in the Inductor codebase itself to add Intel GPU support.

  Please refer to the dedicated [RFC](https://github.com/pytorch/pytorch/issues/114856) for detailed design elaboration.

- CI/CD for Intel GPU
  
  To enable CI/CD for Intel GPUs, we will maximize the reuse of existing PyTorch CI/CD infrastructure and mirror workflows from other hardware. This includes adopting Docker for builds, using label-based triggers for CI/CD pipelines, and similar patterns. Intel GPU specific builds and tests will run on self-hosted runners equipped with Intel GPUs.

  Please refer to the dedicated [RFC](https://github.com/pytorch/pytorch/issues/114850) for detailed design elaboration.

For a more comprehensive and detailed understanding of each component's design, we highly encourage you to explore the respective RFCs linked above. These documents provide in-depth insight and technical specifics that are crucial for a complete grasp of the proposed implementations and integrations.

## Tasks
A more detailed task list is WIP.


```[tasklist]
### Intel GPU Runtime
- [x] oneAPI BaseToolkit Integration
- [x] `Device` for Intel GPU
- [x] `Stream` for Intel GPU
- [x] `Event` for Inel GPU
- [x] `Allocator` for Intel GPU
- [x] `Guard` for Intel GPU
- [x] Random Generator
```

```[tasklist]
### Necessary Native Aten Operation Support
- [x] Integrate XPU OPs as the third-party
- [x] SYCL Compiler Host/Device Separate Compilation
- [x] ATen Operations(Incremental): Elementwise
- [x] ATen Operations(Incremental): Reduction
- [x] ATen Operations(Incremental): Concat, Sort, Arange and Indexing
- [x] Dynamo HuggingFace Benchmark
- [x] Dynamo TIMM Benchmark
- [x] Dynamo TorchBench Benchmark
```

```[tasklist]
### OneDNN Library Integration
- [x] oneDNN Library for Intel GPU Integration
- [x] ATen Operations: Conv
- [x] ATen Operations: GEMM
- [ ] ATen Operations: GEMM-Fused Operations
- [ ] ATen Operations: Conv-Fused Operations
```

```[tasklist]
### Intel GPU Backend for Inductor
- [x] Python Wrapper Code Generation for Intel GPU
- [x] Intel GPU Backend on Top of Triton for Kernel Code Generation
```

```[tasklist]
### CI/CD for Intel GPU
- [x] Self-hosted Runner Hosted in Intel Developer Cloud to Be Available in PyTorch
- [x] AWS-Docker-Based CI/CD Build Task Available for Intel GPU
- [x] CI/CD Test Task Avaiable for Intel GPU
```

## Additional context

This RFC primarily concentrates on enabling Intel GPU support for `torch.compile`. Additionally, we are evaluating the possibility of extending this support to eager mode through `torch.compile` as well. Please refer to #115545.

cc @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519 @voznesenskym @penguinwu @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Intel GPU Upstreaming #114723

TL;DR

Motivation

Approach

Components

Design

Tasks

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Intel GPU Upstreaming #114723

Description

TL;DR

Motivation

Approach

Components

Design

Tasks

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions