Skip to content

Conversation

prateekdesai04
Copy link
Contributor

@prateekdesai04 prateekdesai04 commented Mar 15, 2024

Issue #, if available: #3613

Description of changes:
Follow up PR on: #3663, this PR is to update CI images to use PyTorch 2.1 and CUDA 12.1.
This PR also updates tensorrt and resolves the issue mentioned in the link: #3190 (comment)
Note: An issue might arise due to this upgrade in the Dockerfile.gpu-inference (used only in AG Cloud), a fix for that might come out in the future

This PR also fixes some tests that fail on upgrading to Torch 2.1 additionally some packages are upgraded as well.
This PR will be merged once the latest CPU and GPU images are pushed to ECR, Please do not merge before
This PR also address the issue: #3708

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@prateekdesai04 prateekdesai04 added model list checked You have updated the model list after modifying multimodal unit tests/docs run-multi-gpu Run multimodal multi-gpu tests labels Mar 15, 2024
Copy link
Contributor

@AnirudhDagar AnirudhDagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a minor suggestion about tensorrt and py3.11 support.

Copy link

Job PR-3982-c72fe00 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3982/c72fe00/index.html

@@ -75,6 +75,7 @@ function install_all_no_tests {
}

function build_pkg {
pip install --upgrade setuptools wheel
Copy link
Contributor Author

@prateekdesai04 prateekdesai04 Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated this package because test_install fails due to old version of setuptools

@@ -1,4 +1,4 @@
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use the same cuda version for training and inference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the base DLC inference images that are available only have support for CUDA 11.8
ref:
https://github.com/aws/deep-learning-containers/blob/master/available_images.md#ec2-framework-containers-tested-on-ec2-ecs-and-eks-only

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure the base image CUDA and torch version are consistent. Otherwise, we might have gaps and issues during hosting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, then that would mean tensorrt would fail with the segmentation fault on the inference image.

@@ -123,7 +123,7 @@ def test_f1_metrics_for_multiclass(eval_metric):
)
val_score = predictor._learner._best_score
eval_score = predictor.evaluate(dataset.test_df)[eval_metric]
assert abs(val_score - eval_score) < 1e-4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure @zhiqiangdon is fine with this change

@@ -1,4 +1,4 @@
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure the base image CUDA and torch version are consistent. Otherwise, we might have gaps and issues during hosting

Copy link
Contributor

@zhiqiangdon zhiqiangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for taking care of the upgrade! One TODO item may be making the cuda versions same between training and inference.

Copy link

Job PR-3982-ab02f9e is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3982/ab02f9e/index.html

@drewbitt drewbitt mentioned this pull request Mar 17, 2024
7 tasks
Copy link

Job PR-3982-08f1763 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3982/08f1763/index.html

Copy link
Contributor

@zhiqiangdon zhiqiangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder that the master branch has new commits that we may need to merge into this PR for testing.

@prateekdesai04 prateekdesai04 merged commit 32265af into autogluon:master Mar 18, 2024
ddelange added a commit to ddelange/autogluon that referenced this pull request Mar 21, 2024
…tch-4

* 'master' of https://github.com/awslabs/autogluon: (46 commits)
  [core] move transformers to setup_utils, bump dependency version (autogluon#3984)
  [AutoMM] Fix one lightning upgrade issue (autogluon#3991)
  [CI][Feature] Create a package version table (autogluon#3972)
  [v.1.1][Upgrade] PyTorch 2.1 and CUDA 12.1 upgrade (autogluon#3982)
  [WIP] Code implementation of Conv-LoRA (autogluon#3933)
  [timeseries] Ensure that all metrics handle missing values in the target (autogluon#3966)
  [timeseries] Fix path and device bugs (autogluon#3979)
  [AutoMM]Remove grounding-dino (autogluon#3974)
  [Docs] Update install modules content (autogluon#3976)
  Add note on pd.to_datetime (autogluon#3975)
  [AutoMM] Improve DINO performance (autogluon#3970)
  Minor correction in differ to pick correct environment (autogluon#3968)
  Fix windows python 3.11 issue by removing ray (autogluon#3956)
  [CI][Feature] Package Version Comparator (autogluon#3962)
  [timeseries] Add support for categorical covariates (autogluon#3874)
  [timeseries] Add method for plotting forecasts (autogluon#3889)
  Update conf.py copyright to reflect current year (autogluon#3932)
  [Timeseries][CI]Refactor CI to skip AutoMM and Tabular tests w.r.t timeseries changes (autogluon#3942)
  Fix HPO crash in memory check (autogluon#3931)
  [AutoMM][CI] Capping scikit-learn to avoid HPO test failure (autogluon#3947)
  ...
@ddelange ddelange mentioned this pull request Apr 2, 2024
2 tasks
LennartPurucker pushed a commit to LennartPurucker/autogluon that referenced this pull request Jun 1, 2024
Co-authored-by: Ubuntu <ubuntu@ip-172-31-9-154.us-west-2.compute.internal>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model list checked You have updated the model list after modifying multimodal unit tests/docs run-multi-gpu Run multimodal multi-gpu tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants