-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[DRAFT] Increase the upper bound of torch and lightning to accept 2.1 version #3663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
"torch": ">=2.0,<2.1", # "<{N+1}" upper cap, sync with common/src/autogluon/common/utils/try_import.py | ||
"lightning": ">=2.0.0,<2.1", # "<{N+1}" upper cap | ||
"pytorch_lightning": ">=2.0.0,<2.1", # "<{N+1}" upper cap, capping `lightning` does not cap `pytorch_lightning`! | ||
"torch": ">=2.0,<2.2", # "<{N+1}" upper cap, sync with common/src/autogluon/common/utils/try_import.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, the problem boils down to the following: Running pip install torch==2.0.0
installs PyTorch compiled with CUDA 11.7 (see here). In contrast, running pip install torch==2.1.0
or just pip install torch
(as of Nov 7, 2023) installs PyTorch compiled with CUDA 12. Since the environment that we use to run tests comes with CUDA 11.8 installed, it cannot be used to run PyTorch that was compiled with CUDA12.
I don't think it's related to any specific code that we have in AutoGluon.
To fix this problem, we would need to ensure that during tests we install the correct PyTorch version with something like
pip install torch~=2.1.0 --index-url https://download.pytorch.org/whl/cu118
rather than
pip install torch~=2.1.0
that we currently use.
Some other things that I noticed:
- During installation of
multimodal
, actuallytorch-2.0.1
is installed, even though we increase the cap in this PR. Potentially, this happens because one of themultimodal
dependencies caps torch <2.1. Therefore, the multimodal tests pass. - I tried creating a fresh Python 3.10 environment on a
p3
instance (with V100 GPU) and installing PyTorch 2.1 in it. Trying to use CUDA results in the same error as shown in timeseries logs. To reproduce:Output:conda env create -n cuda12 python=3.10 conda activate cuda12 pip install torch~=2.1.0 python -c "import torch; torch.zeros(1).cuda()"
RuntimeError: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org/ to install a PyTorch version that has been compiled with your version of the CUDA driver
9d27b83
to
c67c245
Compare
from a quick look at the available wheels, all should be fine as long as the end user requests the correct cuda version in the (extra) index url. This will work fine: # cu118 has wheels for 2.0.0 through 2.1.2
pip install autogluon.multimodal --force-reinstall --extra-index-url https://download.pytorch.org/whl/cu118
# cu121 has wheels for 2.1.0 through 2.1.2
pip install autogluon.multimodal --force-reinstall --extra-index-url https://download.pytorch.org/whl/cu121 Apart from maybe a docs update, are there any other blockers here? Asking because this PR blocks support for latest tensorrt (which completes autogluon py3.11 support). |
@tonyhoo Any updates on the status of this PR? |
superceded by #3982 |
Issue #, if available:
Description of changes:
Increase the upper bound of torch and lightning to accept 2.1 version
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.