-
Notifications
You must be signed in to change notification settings - Fork 75
Description
(updated 3/11/2025)
Most of the CUDA-related libraries can be installed using pip
instead of relying on a system-wide installation. Doing this has two primary benefits:
- Allows you to control the CUDA release version used so you don't have to worry about users installing an incompatible version system wide; and
- Allows users to use a system-wide installation that other programs might require - basically they don't have to install/re-install different CUDA versions for each program they want to use.
1) CUDA
Technically, "CUDA" or "CUDA toolkit" refers to multiple "sub-libraries" (if you will) and each can be pip-installed.
For example:
pip install nvidia-cuda-runtime-cu12==12.4.127
pip install nvidia-cublas-cu12==12.4.5.8
pip install nvidia-cuda-nvrtc-cu12==12.4.127
pip install nvidia-cuda-nvcc-cu12==12.4.131
pip install nvidia-cufft-cu12==11.2.1.3
NOTE: Notice how each library has its own version. All of the versions of these libraries originate from CUDA "release" 12.4.1. It's essential to understand that when Nvidia issues a "release" it will use a version such as "12.4.1", but if you want to pip install a certain release you must get the specific "library versions" (not Nvidia CUDA "release version") for compatibility reasons. For example, if you want to pip install all the libraries for CUDA "release" 12.6.3 you must find the specific library versions for each library your program depends on. You can do this by reviewing the .json files here for the particular CUDA "release" you're interested in.
The versions of the individual libraries never match the CUDA "release" version, and frequently do not match between themselves. This is because if Nvidia updates an individual library it is assigned a new version, but in between CUDA "releases" it might not update a particular library, just to give one example. Thus, it's possible you'll see the same version number for a library in different CUDA "releases". The key takeaway is that when pip-installing CUDA, for compatibility reasons, you generally want individual library versions that match the versions used in a particular CUDA "release."
2) cuDNN
cuDNN provides neural network related tasks like convolution, pooling, normalization, and others. It's not "technically" part of "CUDA", but most machine learning stuff requires it. It can also be pip-installed; for example:
pip install nvidia-cudnn-cu12==9.1.0.70
Nvidia promises that cuDNN 9 is forwards and backwards compatible with all CUDA 12.x releases. HOWEVER, you still need to doublecheck whether your program (or other dependencies) are compatible with a specific version of cuDNN that you pip install. For example, for a significant amount of time the ctranslate2
library was NOT compatible with cuDNN 9+ despite being compatible with CUDA 12+. Fortunately, this has been resolved and is no longer an issue, but it is something to be aware of...just because Nvidia promises compatibility doesn't mean that all other libraries that are compatible with a particular version of CUDA will automatically be compatible with a particular version of cuDNN.
A good example of this is the well-known torch
library (discussed further below). It's only "tested" with a particular version of cuDNN
and considers any other version "experimental". See the "compatibility matrix" further below.
If you want to test different version of cuDNN
for whatever reason, can go here or also check out here to get the installables for your platform and desired version.
3) Triton
In simplistic terms, triton
creates custom code that is more efficient that Nvidia GPUs can run. As of 3.2.0 post11
, triton
for Windows can now be installed from pypi:
pip install triton-windows
In previous versions, you would install it by using the link to the specific wheel on the repository's home page; for example:
pip install https://github.com/woct0rdho/triton-windows/releases/download/v3.1.0-windows.post8/triton-3.1.0-cp311-cp311-win_amd64.whl
4) How to use the pip-installed CUDA/cuDNN libraries?
Triton
for Windows (as of post12
) now internally includes the necessary CUDA libraries and other files. You can see this within windows_utils.py
, which prioritizes where to look for them - e.g. first at CUDA_PATH, second internally within Triton, and so on.
However, if for whatever reason you do not want this default behavior, you can still rely on the specific versions of the CUDA libraries that you pip-installed by temporarily setting the CUDA_PATH, using a function like the below set_cuda_paths
function. This function must be run within the entry point script for your program:
Change this depending what all cuda-related libraries your program ACTUALLY relies on
def set_cuda_paths():
import sys
import os
from pathlib import Path
venv_base = Path(sys.executable).parent.parent
nvidia_base_path = venv_base / 'Lib' / 'site-packages' / 'nvidia'
cuda_path_runtime = nvidia_base_path / 'cuda_runtime' / 'bin'
cuda_path_runtime_lib = nvidia_base_path / 'cuda_runtime' / 'lib' / 'x64'
cuda_path_runtime_include = nvidia_base_path / 'cuda_runtime' / 'include'
cublas_path = nvidia_base_path / 'cublas' / 'bin'
cudnn_path = nvidia_base_path / 'cudnn' / 'bin'
nvrtc_path = nvidia_base_path / 'cuda_nvrtc' / 'bin'
nvcc_path = nvidia_base_path / 'cuda_nvcc' / 'bin'
paths_to_add = [
str(cuda_path_runtime),
str(cuda_path_runtime_lib),
str(cuda_path_runtime_include),
str(cublas_path),
str(cudnn_path),
str(nvrtc_path),
str(nvcc_path),
]
current_value = os.environ.get('PATH', '')
new_value = os.pathsep.join(paths_to_add + [current_value] if current_value else paths_to_add)
os.environ['PATH'] = new_value
triton_cuda_path = nvidia_base_path / 'cuda_runtime'
os.environ['CUDA_PATH'] = str(triton_cuda_path)
This TEMPORARILY prepends the relevant PATH and CUDA_PATH variables to where you pip-installed the CUDA (and cuDNN) libraries. Prepending the PATH variable is necessary for libraries other than triton
that your program might use (e.g. transformers
) and prepending CUDA_PATH is necessary for triton
because it relies upon CUDA_PATH primarily.
NOTE: This works when using
venv
and pip-installing the libraries and NOT necessarily when usingconda
, etc., which manage virtual environments differently.
The set_cuda_paths
function essentially forces your program to FIRST look for the necessary libraries where you pip-installed them, but WILL NOT otherwise remove the other paths that other programs may rely on. Moreover, this is temporary...so when you close your program the paths remain the same as they were.
It is crucial to remember that if your program creates a new process/subprocess you must either pass these environment variables or, better yet, call the "set_cuda_paths" function again within the new process/subprocess.
IMPORTANT. check here if you plan on installing any version of triton
other than post12
for more specific details, and adjust your code accordingly.
5) ptxas.exe
Triton
requires ptxas.exe
. As of triton
post12
, this file is bundled and the path is properly set. This bundled version originates from a specific version of CUDA and is fine for most use-cases. However, you can still control which version of ptxas.exe
that triton
uses as discussed in step 7 below.
6) lib
folder
triton
also requires a particular lib
folder, which, for unexplained reasons, is NOT included when pip-installing the CUDA libraries. However, as of triton
post12
this folder is also bundled and the path is correctly set. However, for advanced use cases you can still use a lib
folder that originates from a different CUDA release as discussed in step 7 below.
7) using a specific CUDA version of ptxas.exe
and lib
folder
As of triton
post12
, both ptxas.exe
and the necessary lib
folder are included and the paths to them are properly set. Thus, even if you use something like the set_cuda_paths
function to control where your program looks for CUDA-related files, triton
will now fall back to its "bundled" versions of these libraries. However, for advanced development purposes, which you can do using the following instructions:
- Regarding
ptxas.exe
, when you pip installnvidia-cuda-nvcc-cu12
,ptxas.exe
will be in the\Lib\site-packages\nvidia\cuda_nvcc\bin\
directory. However,triton
looks for it in the\Lib\site-packages\nvidia\cuda_runtime\bin\
directory. Therefore, you simply need to copy it manually or have a Python function do it.
IMPORTANT: I say "copy" because it is crucial that you do not "move" the file. Remember, libraries other than
triton
will still look for it in the standard directory.
- Regarding the
lib
folder, for unexplained reasons, it is NOT included when pip-installing the CUDA libraries. To get thelib
folder for a particular CUDA release you can use this program I created...select the CUDA release you want...then select "CUDA Runtime (cudart)". The relevantlib
folder is within the .zip file that is downloaded. Then copy this folder manually (or in a Python function) to the\Lib\site-packages\nvidia\cuda_runtime\
directory within your virtual environment.
Overall, this is truly an advanced use case and is no longer required from triton
post12
onwards...and should be used with caution. The bundled version of the lib
folder and ptxas.exe
are tried and tested with the specific version of triton
you install. It's really only necessary when, for example, another part of your program requires a specific version of CUDA that's different than what triton
bundles and you want to ensure that all libraries use the same versions, perhaps for testing purposes.
8) What if I update triton
in between versions of my program?
Hypothetically, let's say that Version 1 of your program uses triton
post12
but Version 2 uses a different version of triton
...it will be necessary to delete certain folders containing "compiled code" that the former triton
version created. The folders to delete are located at:
C:\Users\<your username>\.triton\cache\
C:\Users\<your username>\AppData\Local\Temp\torchinductor_<your username>\
Explanation: The first time your program uses triton
(e.g. when using transformers
) it compiles certain code for subsequent uses of your program. This code is NOT deleted when you delete a virtual environment and will persist even if you pip install a different version of triton
. Therefore, anytime you update the version of triton
the above two folders must be deleted. You can instruct users to do this manually, but the better approach is to include a function in your installation script that does it automatically; for example:
def clean_triton_cache():
"""Remove Triton cache to ensure clean compilation with current CUDA paths."""
import shutil
from pathlib import Path
triton_cache_dir = Path.home() / '.triton' / 'cache'
if triton_cache_dir.exists():
try:
print(f"\nRemoving Triton cache at {triton_cache_dir}...")
shutil.rmtree(triton_cache_dir)
print("\033[92mTriton cache successfully removed.\033[0m")
return True
except Exception as e:
print(f"\033[91mWarning: Failed to remove Triton cache: {e}\033[0m")
return False
else:
print("\nNo Triton cache found to clean.")
return True
If this is not done, even though a new version of triton
is installed your program will attempt to use the previously-compiled code. This only needs to be done in an installation script where a new version of your program relies on a different version of triton
.
9) What about other libraries?
Since Triton
is basically never used alone, here is some useful compatibility information to help save time and grief.
torch
& CUDA compatibility
The most recent torch
wheels are described as either cu124
and cu126
. The cu124
and cu126
refer to the CUDA "release" that torch
has been tested with.
+---------------+----------------------------------
| Wheel Name | Torch Versions Supported
+---------------+----------------------------------
| cu126 | 2.6.0
| cu124 | 2.6.0, 2.5.1, 2.5.0, 2.4.1, 2.4.0
Unfortunately, however, the monikers cu124
and cu126
don't explain what specific version; for example, is it 12.4.0 or 12.4.1 or what? To answer this, you have to examine PyTorch's build matrix, which shows the specific CUDA library versions that torch
is tested with:
Current as of 3/9/2025
+--------------+------------+------------+------------+
| | cu124 | cu126 | cu128 | * cu128 not officially released yet
+--------------+------------+------------+------------+
| cuda-nvrtc | 12.4.127 | 12.6.77 | 12.8.61 |
| cuda-runtime | 12.4.127 | 12.6.77 | 12.8.57 |
| cuda-cupti | 12.4.127 | 12.6.80 | 12.8.57 |
| cudnn | 9.1.0.70 | 9.5.1.17 | 9.7.1.26 |
| cublas | 12.4.5.8 | 12.6.4.1 | 12.8.3.14 |
| cufft | 11.2.1.3 | 11.3.0.4 | 11.3.3.41 |
| curand | 10.3.5.147 | 10.3.7.77 | 10.3.9.55 |
| cusolver | 11.6.1.9 | 11.7.1.2 | 11.7.2.55 |
| cusparse | 12.3.1.170 | 12.5.4.2 | 12.5.7.53 |
| cusparselt | 0.6.2 | 0.6.3 | 0.6.3 |
| nccl | 2.25.1 | 2.25.1 | 2.25.1 |
| nvtx | 12.4.127 | 12.6.77 | 12.8.55 |
| nvjitlink | 12.4.127 | 12.6.85 | 12.8.61 |
| cufile | - | 1.11.1.6 | 1.13.0.11 |
+--------------+------------+------------+------------+
Additionally, you then have to examine all of the
.json
files mentioned in step number one (1) above. After doing this, it can be determined thatcu124
refers to 12.4.1,cu126
refers to 12.6.3 (with one notable exception), and the soon-to-be-releasedcu128
refers to 12.8.0. The sole exception regardingcu126
is its usage ofcuda-runtime
version 12.6.77, which stems from CUDA 12.6.2. All other libraries withincu126
come from CUDA 12.6.3.
Overall, torch
only tests with these specific version of the CUDA libraries. I have personally encountered errors when, for example, installing version 12.4.0 instead of 12.4.1...and this holds true for torch
as well as other libraries like flash attention 2
, xformers
etc. (discussed further below). Basically, you need to fully understand which CUDA "release" the specific library versions you pip install originate from, which then allows you to determine if your using a compatible version.
torch
and triton
compatibility
All torch
wheels come with a METADATA
(capitalization intended) file that shows which version of triton
it is compatible with. By examining all permutations of all recent torch
wheels, you get the following table:
+--------+-------+--------+--------+
| Torch | CUDA | Python | Triton |
+--------+-------+--------+--------+
| 2.6.0 | cu126 | 3.13 | 3.2.0 |
| 2.6.0 | cu126 | 3.12 | 3.2.0 |
| 2.6.0 | cu126 | 3.11 | 3.2.0 |
| 2.6.0 | cu124 | 3.13 | 3.2.0 |
| 2.6.0 | cu124 | 3.12 | 3.2.0 |
| 2.6.0 | cu124 | 3.11 | 3.2.0 |
| 2.5.1 | cu124 | 3.12 | 3.1.0 |
| 2.5.1 | cu124 | 3.11 | 3.1.0 |
| 2.5.0 | cu124 | 3.12 | 3.1.0 |
| 2.5.0 | cu124 | 3.11 | 3.1.0 |
| 2.4.1 | cu124 | 3.12 | 3.0.0 |
| 2.4.1 | cu124 | 3.11 | 3.0.0 |
| 2.4.0 | cu124 | 3.12 | 3.0.0 |
| 2.4.0 | cu124 | 3.11 | 3.0.0 |
+--------+-------+--------+--------+
- Notice how
torch==2.5.1
is not compatible withtriton==3.0.0
and only torch 2.6.0+ is compatible with Python 3.13, for example. Important to know...
Overall, almost everything I've discussed for torch
is summarized in the compatibility matrix that Pytorch updates. However, when pip installing CUDA libraries it's essential to understand that libraries (like torch
) are only tested with specific CUDA release versions, and you have to pip install the correct individual library versions associated with a particular CUDA release. In other words, "your mileage may vary" if you pip install library versions that haven't been fully tested by torch
or other libraries. MOST times it will work, but sometimes it will not...
xformers
& flash attention 2
& CUDA
compatibility
'xformers' is strictly tied to a specific torch
version. However, it is a little more flexible regarding flash attention 2
and CUDA
.
Consult these three scripts for the most up-to-date compatibility information:
torch
flash attention 2
CUDA
By examining these three source code files for recent xformers
releases we get the following table (as of 3/10/2025) (unfortunately I didn't have a finish this table so make sure and check yourself):
+------------------+-------+---------------+----------------+---------------+
| Xformers Version | Torch | FA2 | CUDA (excl. 11.x) |
+------------------+-------+---------------+--------------------------------+
| v0.0.29.post3 | 2.6.0 | 2.7.1 - 2.7.2 | 12.1.0, 12.4.1, 12.6.3, 12.8.0 | *pypi
| v0.0.29.post2 | 2.6.0 | 2.7.1 - 2.7.2 | 12.1.0, 12.4.1, 12.6.3, 12.8.0 | *pypi
| v0.0.29.post1 | 2.5.1 | 2.7.1 - 2.7.2 | 12.1.0, 12.4.1 | *only from pytorch
| v0.0.29 (BUG) | 2.5.1 | | | *only from pytorch
| v0.0.28.post3 | 2.5.1 | | | *only from pytorch
| v0.0.28.post2 | 2.5.0 | | | *only from pytorch
| v0.0.28.post1 | 2.4.1 | | | *only from pytorch
| v0.0.27.post2 | 2.4.0 | | | *pypi
| v0.0.27.post1 | 2.4.0 | | | *pypi
| v0.0.27 | 2.3.0 | | | *pypi
| v0.0.26.post1 | 2.3.0 | | | *pypi
| v0.0.25.post1 | 2.2.2 | | | *pypi
+------------------+-------+---------------+--------------------------------+
NOTE: All Linux wheels have always been available using
pip
. However, Windows wheels beginning with0.0.28.post1
through0.0.29.post1
, for whatever reason, were only compiled PyTorch themselves.
They can be obtained here
flash attention 2
compatibility
This repository is currently the best place to get Flash Attention 2 wheels for Windows. Please note that a Windows release is NOT made for every release that the parent repository issues (which are only Linux wheels).
Whereas xformers
is specified tied to a torch
release, flash attention 2
is specifically tied to a CUDA release, although it's fairly flexible regarding torch
compatibility.
Examine this script for the most recent compatibility. In doing so, for recent releases we get the following table:
+--------------+-----------------------------------+-------------------+
| FA2 | Torch | CUDA (excl. 11.x) |
+--------------+-----------------------------------+-------------------+
| v2.7.4.post1 | 2.2.2, 2.3.1, 2.4.0, 2.5.1, 2.6.0 | 12.4.1 |
| v2.7.1.post1 | 2.3.1, 2.4.0, 2.5.1 | 12.4.1 |
| v2.7.0.post2 | 2.3.1, 2.4.0, 2.5.1 | 12.4.1 |
| v2.6.3 | 2.2.2, 2.3.1, 2.4.0 | 12.3.2 |
| v2.6.1 | 2.2.2, 2.3.1 | 12.3.2 |
| v2.5.9.post2 | 2.2.2, 2.3.1 | 12.2.2 |
| v2.5.9.post1 | 2.2.2, 2.3.0 | 12.2.2 |
| v2.5.8 | 2.2.2, 2.3.0 | 12.2.2 |
| v2.5.6 | 2.1.2, 2.2.2 | 12.2.2 |
| v2.5.2 | 2.1.2, 2.2.0 | 12.2.2 |
| v2.4.2 | 2.1.2, 2.2.0 | 12.2.2 |
+--------------+-----------------------------------+-------------------+
Notice how, for example, the maximum supported CUDA version for any release is only 12.4.1.
NOTE:flash attention 2
only supports certain model architectures
pydantic
& langchain
& Python 3.12
Fortunately, the horrible clown car of langchain
updating everything to pydantic
version 2 is almost over...most people should have upgraded finally...but just FYI this was an issue with older versions of langchain
...and Python 3.12...
Python 3.12.4 is incompatible with pydantic.v1 as of pydantic==2.7.3
langchain-ai/langchain#22692
Everything should now be fine as long as Langchain 0.3+ is used, which requires pydantic version 2+