-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Description
TL;DR
This RFC document aims to propose and discuss the upstreaming of Intel GPU support in PyTorch. Our focus is on leveraging Intel's advancements in GPU technology to enhance PyTorch's performance and versatility. This initiative begins with the torch.compile integration as a primary step and marks a significant stride towards incorporating the Intel GPU as a robust computational backend in PyTorch. The RFC outlines key components and a high-level design strategy for this integration. By aligning with PyTorch 2.5 release goals, we aim to provide Intel GPU as a Beta feature to benefit a wide range of users and applications.
Motivation
Intel GPUs significantly enhance workload performance, showcasing their strong capabilities in processing efficiency. We have obtained promising performance in Intel® Extension for PyTorch. Therefore, we upstream features and optimizations buffered in IPEX to the stock PyTorch. This will facilitate the out-of-box experience on the Intel GPU platform for users and benefit the PyTorch community.
Approach
Eventually, we will fully support Intel GPU in PyTorch for both torch.compile
mode and eager mode. From an execution perspective, we will gradually achieve this goal starting with torch.compile
as the initial step to align with the PyTorch 2.5 release as a Beta feature. The functionality and performance maturity will be driven by the dynamo benchmarks – HF, TIMM and TorchBench. Data-types-wise, we will support FP32, TF32, BF16, and FP16 first. Regarding other data types like INT8 and FP8, it is not within the scope of PyTorch 2.5. And we will support all these data types gradually.
In addition, we have added a dedicated dispatch key and device name to PyTorch for Intel GPU that can be found at PyTorch GitHub. Regarding the components or features that we will upstream to the stock Pytorch for Intel GPU, they will be based on the “XPU” device tag.
In summary, the scope of the PyTorch 2.5 release for Intel GPU is as follows:
- Beta: torch.compile functionality and performance
- Pass applicable UTs
- Data types: FP32, TF32, BF16, and FP16
- Proved by 3 benchmarks (HF + TorchBench + TIMM) at minimum
- Larger model coverage as a stretch goal
- Intel® Data Center GPU Max Series Single device and Client GPU Series
- Linux and Windows
- Pip only with pre-built packages @ PyTorch Download
- No Libtorch
Components
Since we are taking torch.compile
as the initial step to align with the PyTorch 2.5 release, we have identified the Minimum Viable Product (MVP) set. It contains five crucial components as follows:
- Intel GPU Runtime – This component is the cornerstone to support other features. It will provide the device/runtime user interfaces like
Stream
,Event
,Device
, and so on. - Minimum Set of Necessary Aten Operations – Although we take a
torch.compile
as the initial step to support Intel GPU, we still must implement a minimum Aten operation set to support situations as follows.- The aten operations that the Inductor backend fallback to aten like
convolution
,matmul
, etc. - The aten operations that need to glue the kernels produced by the Inductor like
randn
,empty
,as_strided
, etc.
- The aten operations that the Inductor backend fallback to aten like
- OneDNN Library Integration – Regarding Intel XEON platforms, we have relied on oneDNN to deliver optimal performance for
convolution
andgemm
operations. It is the same story for Intel GPU. - Intel GPU Backend for Inductor –Intel GPU will integrate with
torch.compile
stack at the Inductor level by providing a Triton-based Inductor device backend. So, it is the crucial component to ensuretorch.compile
to support Intel GPU. - CI/CD for Intel GPU – The CI/CD for Intel GPU customization is the infrastructure and gatekeeper to ensure the quality of all the above components.
Besides the five above crucial components, we will rely on the Intel GPU driver and SYCL to implement the Intel GPU runtime and necessary native aten operations.
Design
In this section, we present a high-level design for each component. Regarding the detailed design, please refer to the dedicated RFC for each component for more information.
-
Intel GPU Runtime
Basically, PyTorch has defined
Device
,Stream
,Event
,Guard
,Generator
, andAllocator
for GPUs. In terms of Intel GPU, we will follow the design and share the source code among different GPUs as much as possible. Besides the common code that we can share with other GPUs, the Intel GPU runtime component will add some SYCL implementations specific to Intel GPU.Please refer to the dedicated RFC for detailed design elaboration.
-
Minimum Set of Necessary Aten Operations
We profiled HuggingFace, TIMM, and TorchBench and collected all the ATen operations that could not run into C++/OpenMP or Triton backend. These operations include elementwise, reduction, random, concat, scan, and indexing. We will implement these Aten operations by SYCL. Before that, we will integrate the SYCL compiler into the PyTorch build system.
Please refer to the dedicated RFC for detailed design elaboration.
-
OneDNN Library Integration
PyTorch has integrated oneDNN as a git submodule for CPU support. For Intel GPU support, we will reuse the same oneDNN codebase. To minimize the integration effort, we intend to separately build oneDNN as a static library for Intel CPU and GPU, respectively. After that, we will statically link the two static libraries to
libtorch_cpu.so
andlibtorch_xpu.so
. This approach avoids directly modifying PyTorch code for oneDNN integration. It also allows us to produce binaries targeted specifically at CPU or GPU hardware while reusing oneDNN source code.Please refer to the dedicated RFC for detailed design elaboration.
-
Intel GPU Backend for Inductor
The Inductor already has a Triton backend to support GPUs, and we have enabled Triton to support Intel GPUs. This means we can easily extend Inductor to support Intel GPUs by building on top of the existing Triton backend. Only minimal code design and changes would be required in the Inductor codebase itself to add Intel GPU support.
Please refer to the dedicated RFC for detailed design elaboration.
-
CI/CD for Intel GPU
To enable CI/CD for Intel GPUs, we will maximize the reuse of existing PyTorch CI/CD infrastructure and mirror workflows from other hardware. This includes adopting Docker for builds, using label-based triggers for CI/CD pipelines, and similar patterns. Intel GPU specific builds and tests will run on self-hosted runners equipped with Intel GPUs.
Please refer to the dedicated RFC for detailed design elaboration.
For a more comprehensive and detailed understanding of each component's design, we highly encourage you to explore the respective RFCs linked above. These documents provide in-depth insight and technical specifics that are crucial for a complete grasp of the proposed implementations and integrations.
Tasks
A more detailed task list is WIP.
### Intel GPU Runtime
- [x] oneAPI BaseToolkit Integration
- [x] `Device` for Intel GPU
- [x] `Stream` for Intel GPU
- [x] `Event` for Inel GPU
- [x] `Allocator` for Intel GPU
- [x] `Guard` for Intel GPU
- [x] Random Generator
### Necessary Native Aten Operation Support
- [x] Integrate XPU OPs as the third-party
- [x] SYCL Compiler Host/Device Separate Compilation
- [x] ATen Operations(Incremental): Elementwise
- [x] ATen Operations(Incremental): Reduction
- [x] ATen Operations(Incremental): Concat, Sort, Arange and Indexing
- [x] Dynamo HuggingFace Benchmark
- [x] Dynamo TIMM Benchmark
- [x] Dynamo TorchBench Benchmark
### OneDNN Library Integration
- [x] oneDNN Library for Intel GPU Integration
- [x] ATen Operations: Conv
- [x] ATen Operations: GEMM
- [ ] ATen Operations: GEMM-Fused Operations
- [ ] ATen Operations: Conv-Fused Operations
### Intel GPU Backend for Inductor
- [x] Python Wrapper Code Generation for Intel GPU
- [x] Intel GPU Backend on Top of Triton for Kernel Code Generation
### CI/CD for Intel GPU
- [x] Self-hosted Runner Hosted in Intel Developer Cloud to Be Available in PyTorch
- [x] AWS-Docker-Based CI/CD Build Task Available for Intel GPU
- [x] CI/CD Test Task Avaiable for Intel GPU
Additional context
This RFC primarily concentrates on enabling Intel GPU support for torch.compile
. Additionally, we are evaluating the possibility of extending this support to eager mode through torch.compile
as well. Please refer to #115545.
cc @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519 @voznesenskym @penguinwu @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler
Metadata
Metadata
Assignees
Labels
Type
Projects
Status