-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[v.1.1][Upgrade] PyTorch 2.1 and CUDA 12.1 upgrade #3982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v.1.1][Upgrade] PyTorch 2.1 and CUDA 12.1 upgrade #3982
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a minor suggestion about tensorrt
and py3.11 support.
Job PR-3982-c72fe00 is done. |
@@ -75,6 +75,7 @@ function install_all_no_tests { | |||
} | |||
|
|||
function build_pkg { | |||
pip install --upgrade setuptools wheel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated this package because test_install
fails due to old version of setuptools
@@ -1,4 +1,4 @@ | |||
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker | |||
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use the same cuda version for training and inference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the base DLC inference images that are available only have support for CUDA 11.8
ref:
https://github.com/aws/deep-learning-containers/blob/master/available_images.md#ec2-framework-containers-tested-on-ec2-ecs-and-eks-only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure the base image CUDA and torch version are consistent. Otherwise, we might have gaps and issues during hosting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, then that would mean tensorrt
would fail with the segmentation fault on the inference image.
@@ -123,7 +123,7 @@ def test_f1_metrics_for_multiclass(eval_metric): | |||
) | |||
val_score = predictor._learner._best_score | |||
eval_score = predictor.evaluate(dataset.test_df)[eval_metric] | |||
assert abs(val_score - eval_score) < 1e-4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure @zhiqiangdon is fine with this change
@@ -1,4 +1,4 @@ | |||
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker | |||
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure the base image CUDA and torch version are consistent. Otherwise, we might have gaps and issues during hosting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for taking care of the upgrade! One TODO item may be making the cuda versions same between training and inference.
Job PR-3982-ab02f9e is done. |
Job PR-3982-08f1763 is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a reminder that the master branch has new commits that we may need to merge into this PR for testing.
…tch-4 * 'master' of https://github.com/awslabs/autogluon: (46 commits) [core] move transformers to setup_utils, bump dependency version (autogluon#3984) [AutoMM] Fix one lightning upgrade issue (autogluon#3991) [CI][Feature] Create a package version table (autogluon#3972) [v.1.1][Upgrade] PyTorch 2.1 and CUDA 12.1 upgrade (autogluon#3982) [WIP] Code implementation of Conv-LoRA (autogluon#3933) [timeseries] Ensure that all metrics handle missing values in the target (autogluon#3966) [timeseries] Fix path and device bugs (autogluon#3979) [AutoMM]Remove grounding-dino (autogluon#3974) [Docs] Update install modules content (autogluon#3976) Add note on pd.to_datetime (autogluon#3975) [AutoMM] Improve DINO performance (autogluon#3970) Minor correction in differ to pick correct environment (autogluon#3968) Fix windows python 3.11 issue by removing ray (autogluon#3956) [CI][Feature] Package Version Comparator (autogluon#3962) [timeseries] Add support for categorical covariates (autogluon#3874) [timeseries] Add method for plotting forecasts (autogluon#3889) Update conf.py copyright to reflect current year (autogluon#3932) [Timeseries][CI]Refactor CI to skip AutoMM and Tabular tests w.r.t timeseries changes (autogluon#3942) Fix HPO crash in memory check (autogluon#3931) [AutoMM][CI] Capping scikit-learn to avoid HPO test failure (autogluon#3947) ...
Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-154.us-west-2.compute.internal>
Issue #, if available: #3613
Description of changes:
Follow up PR on: #3663, this PR is to update CI images to use PyTorch 2.1 and CUDA 12.1.
This PR also updates tensorrt and resolves the issue mentioned in the link: #3190 (comment)
Note: An issue might arise due to this upgrade in the Dockerfile.gpu-inference (used only in AG Cloud), a fix for that might come out in the future
This PR also fixes some tests that fail on upgrading to Torch 2.1 additionally some packages are upgraded as well.
This PR will be merged once the latest CPU and GPU images are pushed to ECR, Please do not merge before
This PR also address the issue: #3708
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.