[v.1.1][Upgrade] PyTorch 2.1 and CUDA 12.1 upgrade #3982

prateekdesai04 · 2024-03-15T06:20:12Z

Issue #, if available: #3613

Description of changes:
Follow up PR on: #3663, this PR is to update CI images to use PyTorch 2.1 and CUDA 12.1.
This PR also updates tensorrt and resolves the issue mentioned in the link: #3190 (comment)
Note: An issue might arise due to this upgrade in the Dockerfile.gpu-inference (used only in AG Cloud), a fix for that might come out in the future

This PR also fixes some tests that fail on upgrading to Torch 2.1 additionally some packages are upgraded as well.
This PR will be merged once the latest CPU and GPU images are pushed to ECR, Please do not merge before
This PR also address the issue: #3708

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

AnirudhDagar

Left a minor suggestion about tensorrt and py3.11 support.

multimodal/setup.py

github-actions · 2024-03-15T20:16:57Z

Job PR-3982-c72fe00 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3982/c72fe00/index.html

prateekdesai04 · 2024-03-15T20:56:49Z

.github/workflow_scripts/env_setup.sh

@@ -75,6 +75,7 @@ function install_all_no_tests {
 }

 function build_pkg {
+    pip install --upgrade setuptools wheel


updated this package because test_install fails due to old version of setuptools

multimodal/tests/unittests/others/test_metrics.py

zhiqiangdon · 2024-03-15T21:28:15Z

CI/docker/Dockerfile.gpu-inference

@@ -1,4 +1,4 @@
-FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker
+FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker


Why don't we use the same cuda version for training and inference?

Currently the base DLC inference images that are available only have support for CUDA 11.8
ref:
https://github.com/aws/deep-learning-containers/blob/master/available_images.md#ec2-framework-containers-tested-on-ec2-ecs-and-eks-only

Let's make sure the base image CUDA and torch version are consistent. Otherwise, we might have gaps and issues during hosting

Interesting, then that would mean tensorrt would fail with the segmentation fault on the inference image.

tonyhoo · 2024-03-15T20:56:19Z

multimodal/tests/unittests/others/test_metrics.py

@@ -123,7 +123,7 @@ def test_f1_metrics_for_multiclass(eval_metric):
    )
    val_score = predictor._learner._best_score
    eval_score = predictor.evaluate(dataset.test_df)[eval_metric]
-    assert abs(val_score - eval_score) < 1e-4


make sure @zhiqiangdon is fine with this change

tonyhoo · 2024-03-15T22:07:51Z

CI/docker/Dockerfile.gpu-inference

@@ -1,4 +1,4 @@
-FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker
+FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker


Let's make sure the base image CUDA and torch version are consistent. Otherwise, we might have gaps and issues during hosting

zhiqiangdon

LGTM. Thanks for taking care of the upgrade! One TODO item may be making the cuda versions same between training and inference.

github-actions · 2024-03-16T02:01:14Z

Job PR-3982-ab02f9e is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3982/ab02f9e/index.html

github-actions · 2024-03-18T21:33:48Z

Job PR-3982-08f1763 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3982/08f1763/index.html

zhiqiangdon

Just a reminder that the master branch has new commits that we may need to merge into this PR for testing.

…tch-4 * 'master' of https://github.com/awslabs/autogluon: (46 commits) [core] move transformers to setup_utils, bump dependency version (autogluon#3984) [AutoMM] Fix one lightning upgrade issue (autogluon#3991) [CI][Feature] Create a package version table (autogluon#3972) [v.1.1][Upgrade] PyTorch 2.1 and CUDA 12.1 upgrade (autogluon#3982) [WIP] Code implementation of Conv-LoRA (autogluon#3933) [timeseries] Ensure that all metrics handle missing values in the target (autogluon#3966) [timeseries] Fix path and device bugs (autogluon#3979) [AutoMM]Remove grounding-dino (autogluon#3974) [Docs] Update install modules content (autogluon#3976) Add note on pd.to_datetime (autogluon#3975) [AutoMM] Improve DINO performance (autogluon#3970) Minor correction in differ to pick correct environment (autogluon#3968) Fix windows python 3.11 issue by removing ray (autogluon#3956) [CI][Feature] Package Version Comparator (autogluon#3962) [timeseries] Add support for categorical covariates (autogluon#3874) [timeseries] Add method for plotting forecasts (autogluon#3889) Update conf.py copyright to reflect current year (autogluon#3932) [Timeseries][CI]Refactor CI to skip AutoMM and Tabular tests w.r.t timeseries changes (autogluon#3942) Fix HPO crash in memory check (autogluon#3931) [AutoMM][CI] Capping scikit-learn to avoid HPO test failure (autogluon#3947) ...

Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-154.us-west-2.compute.internal>

Ubuntu and others added 7 commits February 29, 2024 00:29

test

63cd486

Merge branch 'autogluon:master' into master

0ea0036

Merge branch 'autogluon:master' into master

c89c698

Merge branch 'autogluon:master' into master

b058e48

inital commit

5f884ed

adding more changes and fixing tests

0b63881

fix comment

c72fe00

prateekdesai04 requested review from tonyhoo, Innixma and zhiqiangdon March 15, 2024 06:20

prateekdesai04 added model list checked You have updated the model list after modifying multimodal unit tests/docs run-multi-gpu Run multimodal multi-gpu tests labels Mar 15, 2024

AnirudhDagar reviewed Mar 15, 2024

View reviewed changes

multimodal/setup.py Show resolved Hide resolved

prateekdesai04 mentioned this pull request Mar 15, 2024

[core] move transformers to setup_utils, bump dependency version #3984

Merged

prateekdesai04 commented Mar 15, 2024

View reviewed changes

multimodal/tests/unittests/others/test_metrics.py Outdated Show resolved Hide resolved

zhiqiangdon reviewed Mar 15, 2024

View reviewed changes

reverting torchmetrics to test

7a131ae

tonyhoo reviewed Mar 15, 2024

View reviewed changes

adjusting delta

ab02f9e

zhiqiangdon approved these changes Mar 16, 2024

View reviewed changes

drewbitt mentioned this pull request Mar 17, 2024

Support Python 3.11 #2687

Closed

7 tasks

Ubuntu added 4 commits March 18, 2024 16:39

removing tensorrt upgrade

18a0e24

lint

8b80935

fix

af72497

reverting

08f1763

zhiqiangdon reviewed Mar 18, 2024

View reviewed changes

prateekdesai04 merged commit 32265af into autogluon:master Mar 18, 2024

This was referenced Mar 19, 2024

Add support for python 3.11 #3190

Merged

[DRAFT] Increase the upper bound of torch and lightning to accept 2.1 version #3663

Closed

ddelange mentioned this pull request Apr 2, 2024

Add support for PyTorch 2.1 #3613

Closed

2 tasks

AnirudhDagar mentioned this pull request Apr 2, 2024

Support torchvision v0.16.1 and cuda 12.0 #3827

Closed

LennartPurucker pushed a commit to LennartPurucker/autogluon that referenced this pull request Jun 1, 2024

[v.1.1][Upgrade] PyTorch 2.1 and CUDA 12.1 upgrade (autogluon#3982)

41da382

Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-154.us-west-2.compute.internal>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[v.1.1][Upgrade] PyTorch 2.1 and CUDA 12.1 upgrade #3982

[v.1.1][Upgrade] PyTorch 2.1 and CUDA 12.1 upgrade #3982

Uh oh!

prateekdesai04 commented Mar 15, 2024 •

edited

Loading

Uh oh!

AnirudhDagar left a comment

Uh oh!

Uh oh!

github-actions bot commented Mar 15, 2024

Uh oh!

prateekdesai04 Mar 15, 2024 •

edited

Loading

Uh oh!

Uh oh!

zhiqiangdon Mar 15, 2024

Uh oh!

prateekdesai04 Mar 15, 2024

Uh oh!

tonyhoo Mar 15, 2024

Uh oh!

AnirudhDagar Mar 18, 2024

Uh oh!

tonyhoo Mar 15, 2024

Uh oh!

tonyhoo Mar 15, 2024

Uh oh!

zhiqiangdon left a comment

Uh oh!

github-actions bot commented Mar 16, 2024

Uh oh!

github-actions bot commented Mar 18, 2024

Uh oh!

zhiqiangdon left a comment

Uh oh!

Uh oh!

		@@ -1,4 +1,4 @@
		FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker
		FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker

[v.1.1][Upgrade] PyTorch 2.1 and CUDA 12.1 upgrade #3982

[v.1.1][Upgrade] PyTorch 2.1 and CUDA 12.1 upgrade #3982

Uh oh!

Conversation

prateekdesai04 commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AnirudhDagar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Mar 15, 2024

Uh oh!

prateekdesai04 Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhiqiangdon Mar 15, 2024

Choose a reason for hiding this comment

Uh oh!

prateekdesai04 Mar 15, 2024

Choose a reason for hiding this comment

Uh oh!

tonyhoo Mar 15, 2024

Choose a reason for hiding this comment

Uh oh!

AnirudhDagar Mar 18, 2024

Choose a reason for hiding this comment

Uh oh!

tonyhoo Mar 15, 2024

Choose a reason for hiding this comment

Uh oh!

tonyhoo Mar 15, 2024

Choose a reason for hiding this comment

Uh oh!

zhiqiangdon left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 16, 2024

Uh oh!

github-actions bot commented Mar 18, 2024

Uh oh!

zhiqiangdon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

prateekdesai04 commented Mar 15, 2024 •

edited

Loading

prateekdesai04 Mar 15, 2024 •

edited

Loading