guide for pip-installing cuda/cudnn to be compatible with triton-windows

# (updated 3/11/2025)

Most of the CUDA-related libraries can be installed using `pip` instead of relying on a system-wide installation.  Doing this has two primary benefits:
1) Allows you to control the CUDA release version used so you don't have to worry about users installing an incompatible version system wide; and
2) Allows users to use a system-wide installation that other programs might require - basically they don't have to install/re-install  different CUDA versions for each program they want to use.

## 1) CUDA

Technically, "CUDA" or "CUDA toolkit" refers to multiple "sub-libraries" (if you will) and each can be pip-installed.

For example:

`pip install nvidia-cuda-runtime-cu12==12.4.127`
`pip install nvidia-cublas-cu12==12.4.5.8`
`pip install nvidia-cuda-nvrtc-cu12==12.4.127`
`pip install nvidia-cuda-nvcc-cu12==12.4.131`
`pip install nvidia-cufft-cu12==11.2.1.3`

NOTE: Notice how each library has its own version.  All of the versions of these libraries originate from CUDA "release" 12.4.1.  It's essential to understand that when Nvidia issues a "release" it will use a version such as "12.4.1", but if you want to pip install a certain release you must get the specific "library versions" (not Nvidia CUDA "release version") for compatibility reasons.  For example, if you want to pip install all the libraries for CUDA "release" 12.6.3 you must find the specific library versions for each library your program depends on.  You can do this by reviewing the [.json files here](https://developer.download.nvidia.com/compute/cuda/redist/) for the particular CUDA "release" you're interested in.

  > The versions of the individual libraries never match the CUDA "release" version, and frequently do not match between themselves.  This is because if Nvidia updates an individual library it is assigned a new version, but in between CUDA "releases" it might not update a particular library, just to give one example.  Thus, it's possible you'll see the same version number for a library in different CUDA "releases".  The key takeaway is that when pip-installing CUDA, for compatibility reasons, you generally want individual library versions that match the versions used in a particular CUDA "release."

## 2) cuDNN

cuDNN provides neural network related tasks like convolution, pooling, normalization, and others.  It's not "technically" part of "CUDA", but most machine learning stuff requires it.  It can also be pip-installed; for example:

`pip install nvidia-cudnn-cu12==9.1.0.70`

Nvidia promises that cuDNN 9 is forwards and backwards compatible with all CUDA 12.x releases.  HOWEVER, you still need to doublecheck whether your program (or other dependencies) are compatible with a specific version of cuDNN that you pip install.  For example, for a significant amount of time the `ctranslate2` library was NOT compatible with cuDNN 9+ despite being compatible with CUDA 12+.  Fortunately, this has been resolved and is no longer an issue, but it is something to be aware of...just because Nvidia promises compatibility doesn't mean that all other libraries that are compatible with a particular version of CUDA will automatically be compatible with a particular version of cuDNN.

A good example of this is the well-known `torch` library (discussed further below).  It's only "tested" with a particular version of `cuDNN` and considers any other version "experimental".  See the "compatibility matrix" further below.

If you want to test different version of `cuDNN` for whatever reason, can [go here](https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/) or also [check out here](https://developer.download.nvidia.com/compute/cudnn/redist/) to get the installables for your platform and desired version.

## 3) Triton

In simplistic terms, `triton` creates custom code that is more efficient that Nvidia GPUs can run.  As of 3.2.0 `post11`, `triton` for Windows can now be [installed from pypi](https://pypi.org/project/triton-windows/):

```
pip install triton-windows
```

In previous versions, you would install it by using the link to the specific wheel on the repository's home page; for example:

```
pip install https://github.com/woct0rdho/triton-windows/releases/download/v3.1.0-windows.post8/triton-3.1.0-cp311-cp311-win_amd64.whl
```

## 4) How to use the pip-installed CUDA/cuDNN libraries?

`Triton` for Windows (as of `post12`) now internally includes the necessary CUDA libraries and other files.  You can see this within `windows_utils.py`, which prioritizes where to look for them - e.g. first at CUDA_PATH, second internally within Triton, and so on.

However, if for whatever reason you do not want this default behavior, you can still rely on the specific versions of the CUDA libraries that you pip-installed by temporarily setting the CUDA_PATH, using a function like the below `set_cuda_paths` function.  This function must be run within the entry point script for your program:

 > Change this depending what all cuda-related libraries your program ACTUALLY relies on

```python
def set_cuda_paths():
    import sys
    import os
    from pathlib import Path

    venv_base = Path(sys.executable).parent.parent
    nvidia_base_path = venv_base / 'Lib' / 'site-packages' / 'nvidia'
    cuda_path_runtime = nvidia_base_path / 'cuda_runtime' / 'bin'
    cuda_path_runtime_lib = nvidia_base_path / 'cuda_runtime' / 'lib' / 'x64'
    cuda_path_runtime_include = nvidia_base_path / 'cuda_runtime' / 'include'
    cublas_path = nvidia_base_path / 'cublas' / 'bin'
    cudnn_path = nvidia_base_path / 'cudnn' / 'bin'
    nvrtc_path = nvidia_base_path / 'cuda_nvrtc' / 'bin'
    nvcc_path = nvidia_base_path / 'cuda_nvcc' / 'bin'

    paths_to_add = [
        str(cuda_path_runtime),
        str(cuda_path_runtime_lib),
        str(cuda_path_runtime_include),
        str(cublas_path),
        str(cudnn_path),
        str(nvrtc_path),
        str(nvcc_path),
    ]

    current_value = os.environ.get('PATH', '')
    new_value = os.pathsep.join(paths_to_add + [current_value] if current_value else paths_to_add)
    os.environ['PATH'] = new_value

    triton_cuda_path = nvidia_base_path / 'cuda_runtime'
    os.environ['CUDA_PATH'] = str(triton_cuda_path)
```

This TEMPORARILY prepends the relevant PATH and CUDA_PATH variables to where you pip-installed the CUDA (and cuDNN) libraries.  Prepending the PATH variable is necessary for libraries other than `triton` that your program might use (e.g. `transformers`) and prepending CUDA_PATH is necessary for `triton` because it relies upon CUDA_PATH primarily.
  > NOTE: This works when using `venv` and pip-installing the libraries and NOT necessarily when using `conda`, etc., which manage virtual environments differently.

The `set_cuda_paths` function essentially forces your program to FIRST look for the necessary libraries where you pip-installed them, but WILL NOT otherwise remove the other paths that other programs may rely on.  Moreover, this is temporary...so when you close your program the paths remain the same as they were.

 > It is crucial to remember that if your program creates a new process/subprocess you must either pass these environment variables or, better yet, call the "set_cuda_paths" function again within the new process/subprocess.

### IMPORTANT. [check here](https://github.com/woct0rdho/triton-windows/issues/81#issuecomment-2714575135) if you plan on installing any version of `triton` other than `post12` for more specific details, and adjust your code accordingly.

## 5) ptxas.exe

`Triton` requires `ptxas.exe`.  As of `triton` `post12`, this file is bundled and the path is properly set.  This bundled version originates from a specific version of CUDA and is fine for most use-cases.  However, you can still control which version of `ptxas.exe` that `triton` uses as discussed in step 7 below.

## 6) `lib` folder

`triton` also requires a particular `lib` folder, which, for unexplained reasons, is NOT included when pip-installing the CUDA libraries.  However, as of `triton` `post12` this folder is also bundled and the path is correctly set.  However, for advanced use cases you can still use a `lib` folder that originates from a different CUDA release as discussed in step 7 below.

## 7) using a specific CUDA version of `ptxas.exe` and `lib` folder

As of `triton` `post12`, both `ptxas.exe` and the necessary `lib` folder are included and the paths to them are properly set.  Thus, even if you use something like the `set_cuda_paths` function to control where your program looks for CUDA-related files, `triton` will now fall back to its "bundled" versions of these libraries.  However, for advanced development purposes, which you can do using the following instructions:

1) Regarding `ptxas.exe`, when you pip install `nvidia-cuda-nvcc-cu12`, `ptxas.exe` will be in the `\Lib\site-packages\nvidia\cuda_nvcc\bin\` directory.  However, `triton` looks for it in the `\Lib\site-packages\nvidia\cuda_runtime\bin\` directory.  Therefore, you simply need to copy it manually or have a Python function do it.
  > IMPORTANT: I say "copy" because it is crucial that you do not "move" the file.  Remember, libraries other than `triton` will still look for it in the standard directory.

2) Regarding the `lib` folder, for unexplained reasons, it is NOT included when pip-installing the CUDA libraries.  To get the `lib` folder for a particular CUDA release you can use [this program I created](https://github.com/BBC-Esq/Easy-CUDA-CuDNN-Downloader)...select the CUDA release you want...then select "CUDA Runtime (cudart)".  The relevant `lib` folder is within the .zip file that is downloaded.  Then copy this folder manually (or in a Python function) to the `\Lib\site-packages\nvidia\cuda_runtime\` directory within your virtual environment.

### Overall, this is truly an advanced use case and is no longer required from `triton` `post12` onwards...and should be used with caution.  The bundled version of the `lib` folder and `ptxas.exe` are tried and tested with the specific version of `triton` you install.  It's really only necessary when, for example, another part of your program requires a specific version of CUDA that's different than what `triton` bundles and you want to ensure that all libraries use the same versions, perhaps for testing purposes.

## 8) What if I update `triton` in between versions of my program?

Hypothetically, let's say that Version 1 of your program uses `triton` `post12` but Version 2 uses a different version of `triton`...it will be necessary to delete certain folders containing "compiled code" that the former `triton` version created.  The folders to delete are located at:

```
C:\Users\<your username>\.triton\cache\
C:\Users\<your username>\AppData\Local\Temp\torchinductor_<your username>\
```

Explanation: The first time your program uses `triton` (e.g. when using `transformers`) it compiles certain code for subsequent uses of your program.  This code is NOT deleted when you delete a virtual environment and will persist even if you pip install a different version of `triton`.  Therefore, anytime you update the version of `triton` the above two folders must be deleted.  You can instruct users to do this manually, but the better approach is to include a function in your installation script that does it automatically; for example:

```python
def clean_triton_cache():
    """Remove Triton cache to ensure clean compilation with current CUDA paths."""
    import shutil
    from pathlib import Path

    triton_cache_dir = Path.home() / '.triton' / 'cache'

    if triton_cache_dir.exists():
        try:
            print(f"\nRemoving Triton cache at {triton_cache_dir}...")
            shutil.rmtree(triton_cache_dir)
            print("\033[92mTriton cache successfully removed.\033[0m")
            return True
        except Exception as e:
            print(f"\033[91mWarning: Failed to remove Triton cache: {e}\033[0m")
            return False
    else:
        print("\nNo Triton cache found to clean.")
        return True
```

If this is not done, even though a new version of `triton` is installed your program will attempt to use the previously-compiled code.  This only needs to be done in an installation script where a new version of your program relies on a different version of `triton`.

## 9) What about other libraries?

Since `Triton` is basically never used alone, here is some useful compatibility information to help save time and grief.

## `torch` & CUDA compatibility

The most recent `torch` wheels are described as either `cu124` and `cu126`.  The `cu124` and `cu126` refer to the CUDA "release" that `torch` has been tested with.

```
+---------------+----------------------------------
| Wheel Name    | Torch Versions Supported
+---------------+----------------------------------
| cu126         | 2.6.0
| cu124         | 2.6.0, 2.5.1, 2.5.0, 2.4.1, 2.4.0
```

Unfortunately, however, the monikers `cu124` and `cu126` don't explain what specific version; for example, is it 12.4.0 or 12.4.1 or what?  To answer this, you have to examine PyTorch's [build matrix](https://github.com/pytorch/pytorch/blob/main/.github/scripts/generate_binary_build_matrix.py), which shows the specific CUDA library versions that `torch` is tested with:

```
Current as of 3/9/2025
+--------------+------------+------------+------------+
|              |   cu124    |   cu126    |   cu128    | * cu128 not officially released yet
+--------------+------------+------------+------------+
| cuda-nvrtc   | 12.4.127   | 12.6.77    | 12.8.61    |
| cuda-runtime | 12.4.127   | 12.6.77    | 12.8.57    |
| cuda-cupti   | 12.4.127   | 12.6.80    | 12.8.57    |
| cudnn        | 9.1.0.70   | 9.5.1.17   | 9.7.1.26   |
| cublas       | 12.4.5.8   | 12.6.4.1   | 12.8.3.14  |
| cufft        | 11.2.1.3   | 11.3.0.4   | 11.3.3.41  |
| curand       | 10.3.5.147 | 10.3.7.77  | 10.3.9.55  |
| cusolver     | 11.6.1.9   | 11.7.1.2   | 11.7.2.55  |
| cusparse     | 12.3.1.170 | 12.5.4.2   | 12.5.7.53  |
| cusparselt   | 0.6.2      | 0.6.3      | 0.6.3      |
| nccl         | 2.25.1     | 2.25.1     | 2.25.1     |
| nvtx         | 12.4.127   | 12.6.77    | 12.8.55    |
| nvjitlink    | 12.4.127   | 12.6.85    | 12.8.61    |
| cufile       | -          | 1.11.1.6   | 1.13.0.11  |
+--------------+------------+------------+------------+
```
  > Additionally, you then have to examine all of the `.json` files mentioned in step number one (1) above.  After doing this, it can be determined that `cu124` refers to 12.4.1, `cu126` refers to 12.6.3 (with one notable exception), and the soon-to-be-released `cu128` refers to 12.8.0.  The sole exception regarding `cu126` is its usage of `cuda-runtime` version 12.6.77, which stems from CUDA 12.6.2.  All other libraries within `cu126` come from CUDA 12.6.3.

Overall, `torch` only tests with these specific version of the CUDA libraries.  I have personally encountered errors when, for example, installing version 12.4.0 instead of 12.4.1...and this holds true for `torch` as well as other libraries like `flash attention 2`, `xformers` etc. (discussed further below).  Basically, you need to fully understand which CUDA "release" the specific library versions you pip install originate from, which then allows you to determine if your using a compatible version.

## `torch` and `triton` compatibility

All `torch` wheels come with a `METADATA` (capitalization intended) file that shows which version of `triton` it is compatible with.  By examining all permutations of all recent `torch` wheels, you get the following table:

```
+--------+-------+--------+--------+
| Torch  | CUDA  | Python | Triton |
+--------+-------+--------+--------+
| 2.6.0  | cu126 |  3.13  |  3.2.0 |
| 2.6.0  | cu126 |  3.12  |  3.2.0 |
| 2.6.0  | cu126 |  3.11  |  3.2.0 |
| 2.6.0  | cu124 |  3.13  |  3.2.0 |
| 2.6.0  | cu124 |  3.12  |  3.2.0 |
| 2.6.0  | cu124 |  3.11  |  3.2.0 |
| 2.5.1  | cu124 |  3.12  |  3.1.0 |
| 2.5.1  | cu124 |  3.11  |  3.1.0 |
| 2.5.0  | cu124 |  3.12  |  3.1.0 |
| 2.5.0  | cu124 |  3.11  |  3.1.0 |
| 2.4.1  | cu124 |  3.12  |  3.0.0 |
| 2.4.1  | cu124 |  3.11  |  3.0.0 |
| 2.4.0  | cu124 |  3.12  |  3.0.0 |
| 2.4.0  | cu124 |  3.11  |  3.0.0 |
+--------+-------+--------+--------+
```
* Notice how `torch==2.5.1` is not compatible with `triton==3.0.0` and only torch 2.6.0+ is compatible with Python 3.13, for example.  Important to know...

Overall, almost everything I've discussed for `torch` is summarized in the [compatibility matrix](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix) that Pytorch updates.  However, when pip installing CUDA libraries it's essential to understand that libraries (like `torch`) are only tested with specific CUDA release versions, and you have to pip install the correct individual library versions associated with a particular CUDA release.  In other words, "your mileage may vary" if you pip install library versions that haven't been fully tested by `torch` or other libraries.  MOST times it will work, but sometimes it will not...

## `xformers` & `flash attention 2` & `CUDA` compatibility

'xformers' is strictly tied to a specific `torch` version.  However, it is a little more flexible regarding `flash attention 2` and `CUDA`.

Consult these three scripts for the most up-to-date compatibility information:
[`torch`](https://github.com/facebookresearch/xformers/blob/main/.github/workflows/wheels.yml)
[`flash attention 2`](https://github.com/facebookresearch/xformers/blob/main/xformers/ops/fmha/flash.py)
[`CUDA`](https://github.com/facebookresearch/xformers/blob/main/.github/actions/setup-build-cuda/action.yml)

By examining these three source code files for recent `xformers` releases we get the following table (as of 3/10/2025) (unfortunately I didn't have a finish this table so make sure and check yourself):

```
+------------------+-------+---------------+----------------+---------------+
| Xformers Version | Torch |      FA2      |       CUDA (excl. 11.x)        |
+------------------+-------+---------------+--------------------------------+
| v0.0.29.post3    | 2.6.0 | 2.7.1 - 2.7.2 | 12.1.0, 12.4.1, 12.6.3, 12.8.0 | *pypi
| v0.0.29.post2    | 2.6.0 | 2.7.1 - 2.7.2 | 12.1.0, 12.4.1, 12.6.3, 12.8.0 | *pypi
| v0.0.29.post1    | 2.5.1 | 2.7.1 - 2.7.2 | 12.1.0, 12.4.1                 | *only from pytorch
| v0.0.29 (BUG)    | 2.5.1 |               |                                | *only from pytorch
| v0.0.28.post3    | 2.5.1 |               |                                | *only from pytorch
| v0.0.28.post2    | 2.5.0 |               |                                | *only from pytorch
| v0.0.28.post1    | 2.4.1 |               |                                | *only from pytorch
| v0.0.27.post2    | 2.4.0 |               |                                | *pypi
| v0.0.27.post1    | 2.4.0 |               |                                | *pypi
| v0.0.27          | 2.3.0 |               |                                | *pypi
| v0.0.26.post1    | 2.3.0 |               |                                | *pypi
| v0.0.25.post1    | 2.2.2 |               |                                | *pypi
+------------------+-------+---------------+--------------------------------+
```
  > NOTE: All Linux wheels have always been available using `pip`.  However, Windows wheels beginning with `0.0.28.post1` through `0.0.29.post1`, for whatever reason, were only compiled PyTorch themselves.
  > They can be obtained [here](https://download.pytorch.org/whl/xformers/)

## `flash attention 2` compatibility

[This repository](https://github.com/kingbri1/flash-attention) is currently the best place to get Flash Attention 2 wheels for Windows.  Please note that a Windows release is NOT made for every release that the parent repository issues (which are only Linux wheels).

Whereas `xformers` is specified tied to a `torch` release, `flash attention 2` is specifically tied to a CUDA release, although it's fairly flexible regarding `torch` compatibility.

[Examine this script](https://github.com/Dao-AILab/flash-attention/blob/main/.github/workflows/publish.yml) for the most recent compatibility.  In doing so, for recent releases we get the following table:

```
+--------------+-----------------------------------+-------------------+
| FA2          |              Torch                | CUDA (excl. 11.x) |
+--------------+-----------------------------------+-------------------+
| v2.7.4.post1 | 2.2.2, 2.3.1, 2.4.0, 2.5.1, 2.6.0 | 12.4.1            |
| v2.7.1.post1 | 2.3.1, 2.4.0, 2.5.1               | 12.4.1            |
| v2.7.0.post2 | 2.3.1, 2.4.0, 2.5.1               | 12.4.1            |
| v2.6.3       | 2.2.2, 2.3.1, 2.4.0               | 12.3.2            |
| v2.6.1       | 2.2.2, 2.3.1                      | 12.3.2            |
| v2.5.9.post2 | 2.2.2, 2.3.1                      | 12.2.2            |
| v2.5.9.post1 | 2.2.2, 2.3.0                      | 12.2.2            |
| v2.5.8       | 2.2.2, 2.3.0                      | 12.2.2            |
| v2.5.6       | 2.1.2, 2.2.2                      | 12.2.2            |
| v2.5.2       | 2.1.2, 2.2.0                      | 12.2.2            |
| v2.4.2       | 2.1.2, 2.2.0                      | 12.2.2            |
+--------------+-----------------------------------+-------------------+
```
  > Notice how, for example, the maximum supported CUDA version for any release is only 12.4.1.
  > NOTE: `flash attention 2` only supports [certain model architectures](https://huggingface.co/docs/transformers/v4.49.0/en/perf_infer_gpu_one)

## `pydantic` & `langchain` & Python 3.12

Fortunately, the horrible clown car of `langchain` updating everything to `pydantic` version 2 is almost over...most people should have upgraded finally...but just FYI this was an issue with older versions of `langchain`...and Python 3.12...

Python 3.12.4 is incompatible with pydantic.v1 as of pydantic==2.7.3
https://github.com/langchain-ai/langchain/issues/22692
***Everything should now be fine as long as Langchain 0.3+ is used, which requires pydantic version 2+***

# Conclusion

### If your program uses a combination of flash attention 2, xformers, torch, CUDA, and triton...the bottleneck is Flash Attention 2, the most recent release of which only supports CUDA 12.4.1.  Therefore, since most users simply install the latest "release" system-wide and expect things to just "work," please consider pip-installing the CUDA libraries instead to ensure compatibility...this will save a lot of time and grief to novice users and programmers like myself...Hope this helps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

guide for pip-installing cuda/cudnn to be compatible with triton-windows #43

(updated 3/11/2025)

1) CUDA

2) cuDNN

3) Triton

4) How to use the pip-installed CUDA/cuDNN libraries?

IMPORTANT. check here if you plan on installing any version of `triton` other than `post12` for more specific details, and adjust your code accordingly.

5) ptxas.exe

6) `lib` folder

7) using a specific CUDA version of `ptxas.exe` and `lib` folder

8) What if I update `triton` in between versions of my program?

9) What about other libraries?

`torch` & CUDA compatibility

`torch` and `triton` compatibility

`xformers` & `flash attention 2` & `CUDA` compatibility

`flash attention 2` compatibility

`pydantic` & `langchain` & Python 3.12

Conclusion

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

guide for pip-installing cuda/cudnn to be compatible with triton-windows #43

Description

(updated 3/11/2025)

1) CUDA

2) cuDNN

3) Triton

4) How to use the pip-installed CUDA/cuDNN libraries?

IMPORTANT. check here if you plan on installing any version of triton other than post12 for more specific details, and adjust your code accordingly.

5) ptxas.exe

6) lib folder

7) using a specific CUDA version of ptxas.exe and lib folder

8) What if I update triton in between versions of my program?

9) What about other libraries?

torch & CUDA compatibility

torch and triton compatibility

xformers & flash attention 2 & CUDA compatibility

flash attention 2 compatibility

pydantic & langchain & Python 3.12

Conclusion

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

IMPORTANT. check here if you plan on installing any version of `triton` other than `post12` for more specific details, and adjust your code accordingly.

6) `lib` folder

7) using a specific CUDA version of `ptxas.exe` and `lib` folder

8) What if I update `triton` in between versions of my program?

`torch` & CUDA compatibility

`torch` and `triton` compatibility

`xformers` & `flash attention 2` & `CUDA` compatibility

`flash attention 2` compatibility

`pydantic` & `langchain` & Python 3.12