Skip to content

2.5.0 stable + cuda 12.4 broken; unable to to find cuda libs (works 2.4.1) #138324

@dcsouthwick

Description

@dcsouthwick

🐛 Describe the bug

torch 2.5.0 stable from pip with cuda 12.4 results in a reproducible broken install when attempting to follow 'Getting Started' guide:

docker run -it --rm --gpus=all almalinux/9-base
[root@a8af28733c07 /]# python3 -V
Python 3.9.18
[root@a8af28733c07 /]# python3 -m pip install torch torchvision torchaudio
[root@a8af28733c07 /]# python3
>>> import torch
Traceback (most recent call last):
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 300, in _load_global_deps
    ctypes.CDLL(global_deps_lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib64/python3.9/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcudart.so.12: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 367, in <module>
    _load_global_deps()
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 325, in _load_global_deps
    _preload_cuda_deps(lib_folder, lib_name)
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 284, in _preload_cuda_deps
    raise ValueError(f"{lib_name} not found in the system path {sys.path}")
ValueError: libcufile.so.*[0-9] not found in the system path ['', '/usr/lib64/python39.zip', '/usr/lib64/python3.9', '/usr/lib64/python3.9/lib-dynload', '/usr/local/lib64/python3.9/site-packages', '/usr/local/lib/python3.9/site-packages', '/usr/lib64/python3.9/site-packages', '/usr/lib/python3.9/site-packages']

This works fine for the previous version; eg 2.4.1, 2.4.0, etc:
python3 -m pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124

I notice for previous versions when installing, resulting torch version is 2.4.1+cu124, whereas current stable install instructions result in 2.5.0 without +cu124 - is this a simple documentation issue?

Versions

torch-2.5.0-cp39-cp39-manylinux1_x86_64.whl from pypi

Diagnostic script relies on broken distribution of torch:

[root@a8af28733c07 /]# wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
[root@a8af28733c07 /]# python3 collect_env.py
Traceback (most recent call last):
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 300, in _load_global_deps
    ctypes.CDLL(global_deps_lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib64/python3.9/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcudart.so.12: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "//collect_env.py", line 17, in <module>
    import torch
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 367, in <module>
    _load_global_deps()
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 325, in _load_global_deps
    _preload_cuda_deps(lib_folder, lib_name)
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 284, in _preload_cuda_deps
    raise ValueError(f"{lib_name} not found in the system path {sys.path}")
ValueError: libcufile.so.*[0-9] not found in the system path ['/', '/usr/lib64/python39.zip', '/usr/lib64/python3.9', '/usr/lib64/python3.9/lib-dynload', '/usr/local/lib64/python3.9/site-packages', '/usr/local/lib/python3.9/site-packages', '/usr/lib64/python3.9/site-packages', '/usr/lib/python3.9/site-packages']

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @seemethere @malfet @osalpekar @atalman @ptrblck

Metadata

Metadata

Labels

high prioritymodule: binariesAnything related to official binaries that we release to usersmodule: cudaRelated to torch.cuda, and CUDA support in generalmodule: regressionIt used to work, and now it doesn'ttriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions