Skip to content

guide for pip-installing cuda/cudnn to be compatible with triton-windows #43

@BBC-Esq

Description

@BBC-Esq

(updated 3/11/2025)

Most of the CUDA-related libraries can be installed using pip instead of relying on a system-wide installation. Doing this has two primary benefits:

  1. Allows you to control the CUDA release version used so you don't have to worry about users installing an incompatible version system wide; and
  2. Allows users to use a system-wide installation that other programs might require - basically they don't have to install/re-install different CUDA versions for each program they want to use.

1) CUDA

Technically, "CUDA" or "CUDA toolkit" refers to multiple "sub-libraries" (if you will) and each can be pip-installed.

For example:

pip install nvidia-cuda-runtime-cu12==12.4.127
pip install nvidia-cublas-cu12==12.4.5.8
pip install nvidia-cuda-nvrtc-cu12==12.4.127
pip install nvidia-cuda-nvcc-cu12==12.4.131
pip install nvidia-cufft-cu12==11.2.1.3

NOTE: Notice how each library has its own version. All of the versions of these libraries originate from CUDA "release" 12.4.1. It's essential to understand that when Nvidia issues a "release" it will use a version such as "12.4.1", but if you want to pip install a certain release you must get the specific "library versions" (not Nvidia CUDA "release version") for compatibility reasons. For example, if you want to pip install all the libraries for CUDA "release" 12.6.3 you must find the specific library versions for each library your program depends on. You can do this by reviewing the .json files here for the particular CUDA "release" you're interested in.

The versions of the individual libraries never match the CUDA "release" version, and frequently do not match between themselves. This is because if Nvidia updates an individual library it is assigned a new version, but in between CUDA "releases" it might not update a particular library, just to give one example. Thus, it's possible you'll see the same version number for a library in different CUDA "releases". The key takeaway is that when pip-installing CUDA, for compatibility reasons, you generally want individual library versions that match the versions used in a particular CUDA "release."

2) cuDNN

cuDNN provides neural network related tasks like convolution, pooling, normalization, and others. It's not "technically" part of "CUDA", but most machine learning stuff requires it. It can also be pip-installed; for example:

pip install nvidia-cudnn-cu12==9.1.0.70

Nvidia promises that cuDNN 9 is forwards and backwards compatible with all CUDA 12.x releases. HOWEVER, you still need to doublecheck whether your program (or other dependencies) are compatible with a specific version of cuDNN that you pip install. For example, for a significant amount of time the ctranslate2 library was NOT compatible with cuDNN 9+ despite being compatible with CUDA 12+. Fortunately, this has been resolved and is no longer an issue, but it is something to be aware of...just because Nvidia promises compatibility doesn't mean that all other libraries that are compatible with a particular version of CUDA will automatically be compatible with a particular version of cuDNN.

A good example of this is the well-known torch library (discussed further below). It's only "tested" with a particular version of cuDNN and considers any other version "experimental". See the "compatibility matrix" further below.

If you want to test different version of cuDNN for whatever reason, can go here or also check out here to get the installables for your platform and desired version.

3) Triton

In simplistic terms, triton creates custom code that is more efficient that Nvidia GPUs can run. As of 3.2.0 post11, triton for Windows can now be installed from pypi:

pip install triton-windows

In previous versions, you would install it by using the link to the specific wheel on the repository's home page; for example:

pip install https://github.com/woct0rdho/triton-windows/releases/download/v3.1.0-windows.post8/triton-3.1.0-cp311-cp311-win_amd64.whl

4) How to use the pip-installed CUDA/cuDNN libraries?

Triton for Windows (as of post12) now internally includes the necessary CUDA libraries and other files. You can see this within windows_utils.py, which prioritizes where to look for them - e.g. first at CUDA_PATH, second internally within Triton, and so on.

However, if for whatever reason you do not want this default behavior, you can still rely on the specific versions of the CUDA libraries that you pip-installed by temporarily setting the CUDA_PATH, using a function like the below set_cuda_paths function. This function must be run within the entry point script for your program:

Change this depending what all cuda-related libraries your program ACTUALLY relies on

def set_cuda_paths():
    import sys
    import os
    from pathlib import Path

    venv_base = Path(sys.executable).parent.parent
    nvidia_base_path = venv_base / 'Lib' / 'site-packages' / 'nvidia'
    cuda_path_runtime = nvidia_base_path / 'cuda_runtime' / 'bin'
    cuda_path_runtime_lib = nvidia_base_path / 'cuda_runtime' / 'lib' / 'x64'
    cuda_path_runtime_include = nvidia_base_path / 'cuda_runtime' / 'include'
    cublas_path = nvidia_base_path / 'cublas' / 'bin'
    cudnn_path = nvidia_base_path / 'cudnn' / 'bin'
    nvrtc_path = nvidia_base_path / 'cuda_nvrtc' / 'bin'
    nvcc_path = nvidia_base_path / 'cuda_nvcc' / 'bin'

    paths_to_add = [
        str(cuda_path_runtime),
        str(cuda_path_runtime_lib),
        str(cuda_path_runtime_include),
        str(cublas_path),
        str(cudnn_path),
        str(nvrtc_path),
        str(nvcc_path),
    ]

    current_value = os.environ.get('PATH', '')
    new_value = os.pathsep.join(paths_to_add + [current_value] if current_value else paths_to_add)
    os.environ['PATH'] = new_value

    triton_cuda_path = nvidia_base_path / 'cuda_runtime'
    os.environ['CUDA_PATH'] = str(triton_cuda_path)

This TEMPORARILY prepends the relevant PATH and CUDA_PATH variables to where you pip-installed the CUDA (and cuDNN) libraries. Prepending the PATH variable is necessary for libraries other than triton that your program might use (e.g. transformers) and prepending CUDA_PATH is necessary for triton because it relies upon CUDA_PATH primarily.

NOTE: This works when using venv and pip-installing the libraries and NOT necessarily when using conda, etc., which manage virtual environments differently.

The set_cuda_paths function essentially forces your program to FIRST look for the necessary libraries where you pip-installed them, but WILL NOT otherwise remove the other paths that other programs may rely on. Moreover, this is temporary...so when you close your program the paths remain the same as they were.

It is crucial to remember that if your program creates a new process/subprocess you must either pass these environment variables or, better yet, call the "set_cuda_paths" function again within the new process/subprocess.

IMPORTANT. check here if you plan on installing any version of triton other than post12 for more specific details, and adjust your code accordingly.

5) ptxas.exe

Triton requires ptxas.exe. As of triton post12, this file is bundled and the path is properly set. This bundled version originates from a specific version of CUDA and is fine for most use-cases. However, you can still control which version of ptxas.exe that triton uses as discussed in step 7 below.

6) lib folder

triton also requires a particular lib folder, which, for unexplained reasons, is NOT included when pip-installing the CUDA libraries. However, as of triton post12 this folder is also bundled and the path is correctly set. However, for advanced use cases you can still use a lib folder that originates from a different CUDA release as discussed in step 7 below.

7) using a specific CUDA version of ptxas.exe and lib folder

As of triton post12, both ptxas.exe and the necessary lib folder are included and the paths to them are properly set. Thus, even if you use something like the set_cuda_paths function to control where your program looks for CUDA-related files, triton will now fall back to its "bundled" versions of these libraries. However, for advanced development purposes, which you can do using the following instructions:

  1. Regarding ptxas.exe, when you pip install nvidia-cuda-nvcc-cu12, ptxas.exe will be in the \Lib\site-packages\nvidia\cuda_nvcc\bin\ directory. However, triton looks for it in the \Lib\site-packages\nvidia\cuda_runtime\bin\ directory. Therefore, you simply need to copy it manually or have a Python function do it.

IMPORTANT: I say "copy" because it is crucial that you do not "move" the file. Remember, libraries other than triton will still look for it in the standard directory.

  1. Regarding the lib folder, for unexplained reasons, it is NOT included when pip-installing the CUDA libraries. To get the lib folder for a particular CUDA release you can use this program I created...select the CUDA release you want...then select "CUDA Runtime (cudart)". The relevant lib folder is within the .zip file that is downloaded. Then copy this folder manually (or in a Python function) to the \Lib\site-packages\nvidia\cuda_runtime\ directory within your virtual environment.

Overall, this is truly an advanced use case and is no longer required from triton post12 onwards...and should be used with caution. The bundled version of the lib folder and ptxas.exe are tried and tested with the specific version of triton you install. It's really only necessary when, for example, another part of your program requires a specific version of CUDA that's different than what triton bundles and you want to ensure that all libraries use the same versions, perhaps for testing purposes.

8) What if I update triton in between versions of my program?

Hypothetically, let's say that Version 1 of your program uses triton post12 but Version 2 uses a different version of triton...it will be necessary to delete certain folders containing "compiled code" that the former triton version created. The folders to delete are located at:

C:\Users\<your username>\.triton\cache\
C:\Users\<your username>\AppData\Local\Temp\torchinductor_<your username>\

Explanation: The first time your program uses triton (e.g. when using transformers) it compiles certain code for subsequent uses of your program. This code is NOT deleted when you delete a virtual environment and will persist even if you pip install a different version of triton. Therefore, anytime you update the version of triton the above two folders must be deleted. You can instruct users to do this manually, but the better approach is to include a function in your installation script that does it automatically; for example:

def clean_triton_cache():
    """Remove Triton cache to ensure clean compilation with current CUDA paths."""
    import shutil
    from pathlib import Path

    triton_cache_dir = Path.home() / '.triton' / 'cache'

    if triton_cache_dir.exists():
        try:
            print(f"\nRemoving Triton cache at {triton_cache_dir}...")
            shutil.rmtree(triton_cache_dir)
            print("\033[92mTriton cache successfully removed.\033[0m")
            return True
        except Exception as e:
            print(f"\033[91mWarning: Failed to remove Triton cache: {e}\033[0m")
            return False
    else:
        print("\nNo Triton cache found to clean.")
        return True

If this is not done, even though a new version of triton is installed your program will attempt to use the previously-compiled code. This only needs to be done in an installation script where a new version of your program relies on a different version of triton.

9) What about other libraries?

Since Triton is basically never used alone, here is some useful compatibility information to help save time and grief.

torch & CUDA compatibility

The most recent torch wheels are described as either cu124 and cu126. The cu124 and cu126 refer to the CUDA "release" that torch has been tested with.

+---------------+----------------------------------
| Wheel Name    | Torch Versions Supported
+---------------+----------------------------------
| cu126         | 2.6.0
| cu124         | 2.6.0, 2.5.1, 2.5.0, 2.4.1, 2.4.0

Unfortunately, however, the monikers cu124 and cu126 don't explain what specific version; for example, is it 12.4.0 or 12.4.1 or what? To answer this, you have to examine PyTorch's build matrix, which shows the specific CUDA library versions that torch is tested with:

Current as of 3/9/2025
+--------------+------------+------------+------------+
|              |   cu124    |   cu126    |   cu128    | * cu128 not officially released yet
+--------------+------------+------------+------------+
| cuda-nvrtc   | 12.4.127   | 12.6.77    | 12.8.61    |
| cuda-runtime | 12.4.127   | 12.6.77    | 12.8.57    |
| cuda-cupti   | 12.4.127   | 12.6.80    | 12.8.57    |
| cudnn        | 9.1.0.70   | 9.5.1.17   | 9.7.1.26   |
| cublas       | 12.4.5.8   | 12.6.4.1   | 12.8.3.14  |
| cufft        | 11.2.1.3   | 11.3.0.4   | 11.3.3.41  |
| curand       | 10.3.5.147 | 10.3.7.77  | 10.3.9.55  |
| cusolver     | 11.6.1.9   | 11.7.1.2   | 11.7.2.55  |
| cusparse     | 12.3.1.170 | 12.5.4.2   | 12.5.7.53  |
| cusparselt   | 0.6.2      | 0.6.3      | 0.6.3      |
| nccl         | 2.25.1     | 2.25.1     | 2.25.1     |
| nvtx         | 12.4.127   | 12.6.77    | 12.8.55    |
| nvjitlink    | 12.4.127   | 12.6.85    | 12.8.61    |
| cufile       | -          | 1.11.1.6   | 1.13.0.11  |
+--------------+------------+------------+------------+

Additionally, you then have to examine all of the .json files mentioned in step number one (1) above. After doing this, it can be determined that cu124 refers to 12.4.1, cu126 refers to 12.6.3 (with one notable exception), and the soon-to-be-released cu128 refers to 12.8.0. The sole exception regarding cu126 is its usage of cuda-runtime version 12.6.77, which stems from CUDA 12.6.2. All other libraries within cu126 come from CUDA 12.6.3.

Overall, torch only tests with these specific version of the CUDA libraries. I have personally encountered errors when, for example, installing version 12.4.0 instead of 12.4.1...and this holds true for torch as well as other libraries like flash attention 2, xformers etc. (discussed further below). Basically, you need to fully understand which CUDA "release" the specific library versions you pip install originate from, which then allows you to determine if your using a compatible version.

torch and triton compatibility

All torch wheels come with a METADATA (capitalization intended) file that shows which version of triton it is compatible with. By examining all permutations of all recent torch wheels, you get the following table:

+--------+-------+--------+--------+
| Torch  | CUDA  | Python | Triton |
+--------+-------+--------+--------+
| 2.6.0  | cu126 |  3.13  |  3.2.0 |
| 2.6.0  | cu126 |  3.12  |  3.2.0 |
| 2.6.0  | cu126 |  3.11  |  3.2.0 |
| 2.6.0  | cu124 |  3.13  |  3.2.0 |
| 2.6.0  | cu124 |  3.12  |  3.2.0 |
| 2.6.0  | cu124 |  3.11  |  3.2.0 |
| 2.5.1  | cu124 |  3.12  |  3.1.0 |
| 2.5.1  | cu124 |  3.11  |  3.1.0 |
| 2.5.0  | cu124 |  3.12  |  3.1.0 |
| 2.5.0  | cu124 |  3.11  |  3.1.0 |
| 2.4.1  | cu124 |  3.12  |  3.0.0 |
| 2.4.1  | cu124 |  3.11  |  3.0.0 |
| 2.4.0  | cu124 |  3.12  |  3.0.0 |
| 2.4.0  | cu124 |  3.11  |  3.0.0 |
+--------+-------+--------+--------+
  • Notice how torch==2.5.1 is not compatible with triton==3.0.0 and only torch 2.6.0+ is compatible with Python 3.13, for example. Important to know...

Overall, almost everything I've discussed for torch is summarized in the compatibility matrix that Pytorch updates. However, when pip installing CUDA libraries it's essential to understand that libraries (like torch) are only tested with specific CUDA release versions, and you have to pip install the correct individual library versions associated with a particular CUDA release. In other words, "your mileage may vary" if you pip install library versions that haven't been fully tested by torch or other libraries. MOST times it will work, but sometimes it will not...

xformers & flash attention 2 & CUDA compatibility

'xformers' is strictly tied to a specific torch version. However, it is a little more flexible regarding flash attention 2 and CUDA.

Consult these three scripts for the most up-to-date compatibility information:
torch
flash attention 2
CUDA

By examining these three source code files for recent xformers releases we get the following table (as of 3/10/2025) (unfortunately I didn't have a finish this table so make sure and check yourself):

+------------------+-------+---------------+----------------+---------------+
| Xformers Version | Torch |      FA2      |       CUDA (excl. 11.x)        |
+------------------+-------+---------------+--------------------------------+
| v0.0.29.post3    | 2.6.0 | 2.7.1 - 2.7.2 | 12.1.0, 12.4.1, 12.6.3, 12.8.0 | *pypi
| v0.0.29.post2    | 2.6.0 | 2.7.1 - 2.7.2 | 12.1.0, 12.4.1, 12.6.3, 12.8.0 | *pypi
| v0.0.29.post1    | 2.5.1 | 2.7.1 - 2.7.2 | 12.1.0, 12.4.1                 | *only from pytorch
| v0.0.29 (BUG)    | 2.5.1 |               |                                | *only from pytorch
| v0.0.28.post3    | 2.5.1 |               |                                | *only from pytorch
| v0.0.28.post2    | 2.5.0 |               |                                | *only from pytorch
| v0.0.28.post1    | 2.4.1 |               |                                | *only from pytorch
| v0.0.27.post2    | 2.4.0 |               |                                | *pypi
| v0.0.27.post1    | 2.4.0 |               |                                | *pypi
| v0.0.27          | 2.3.0 |               |                                | *pypi
| v0.0.26.post1    | 2.3.0 |               |                                | *pypi
| v0.0.25.post1    | 2.2.2 |               |                                | *pypi
+------------------+-------+---------------+--------------------------------+

NOTE: All Linux wheels have always been available using pip. However, Windows wheels beginning with 0.0.28.post1 through 0.0.29.post1, for whatever reason, were only compiled PyTorch themselves.
They can be obtained here

flash attention 2 compatibility

This repository is currently the best place to get Flash Attention 2 wheels for Windows. Please note that a Windows release is NOT made for every release that the parent repository issues (which are only Linux wheels).

Whereas xformers is specified tied to a torch release, flash attention 2 is specifically tied to a CUDA release, although it's fairly flexible regarding torch compatibility.

Examine this script for the most recent compatibility. In doing so, for recent releases we get the following table:

+--------------+-----------------------------------+-------------------+
| FA2          |              Torch                | CUDA (excl. 11.x) |
+--------------+-----------------------------------+-------------------+
| v2.7.4.post1 | 2.2.2, 2.3.1, 2.4.0, 2.5.1, 2.6.0 | 12.4.1            |
| v2.7.1.post1 | 2.3.1, 2.4.0, 2.5.1               | 12.4.1            |
| v2.7.0.post2 | 2.3.1, 2.4.0, 2.5.1               | 12.4.1            |
| v2.6.3       | 2.2.2, 2.3.1, 2.4.0               | 12.3.2            |
| v2.6.1       | 2.2.2, 2.3.1                      | 12.3.2            |
| v2.5.9.post2 | 2.2.2, 2.3.1                      | 12.2.2            |
| v2.5.9.post1 | 2.2.2, 2.3.0                      | 12.2.2            |
| v2.5.8       | 2.2.2, 2.3.0                      | 12.2.2            |
| v2.5.6       | 2.1.2, 2.2.2                      | 12.2.2            |
| v2.5.2       | 2.1.2, 2.2.0                      | 12.2.2            |
| v2.4.2       | 2.1.2, 2.2.0                      | 12.2.2            |
+--------------+-----------------------------------+-------------------+

Notice how, for example, the maximum supported CUDA version for any release is only 12.4.1.
NOTE: flash attention 2 only supports certain model architectures

pydantic & langchain & Python 3.12

Fortunately, the horrible clown car of langchain updating everything to pydantic version 2 is almost over...most people should have upgraded finally...but just FYI this was an issue with older versions of langchain...and Python 3.12...

Python 3.12.4 is incompatible with pydantic.v1 as of pydantic==2.7.3
langchain-ai/langchain#22692
Everything should now be fine as long as Langchain 0.3+ is used, which requires pydantic version 2+

Conclusion

If your program uses a combination of flash attention 2, xformers, torch, CUDA, and triton...the bottleneck is Flash Attention 2, the most recent release of which only supports CUDA 12.4.1. Therefore, since most users simply install the latest "release" system-wide and expect things to just "work," please consider pip-installing the CUDA libraries instead to ensure compatibility...this will save a lot of time and grief to novice users and programmers like myself...Hope this helps!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions