Bump ray to 2.10 #3774

ddelange · 2023-11-28T10:46:14Z

Issue #, if available: #2687 (comment)

Description of changes: Allow ray 2.7 which includes critical fixes for py 3.11

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Fix #3807, fix #3920

AnirudhDagar

Hi @ddelange, thanks for submitting the PR. ray 2.7 introduced some hard deprecations, the failed tests here are due to those deprecation warnings caused by the trainer taking in the checkpoint_dir argument. I'm currently working on a fix for the same, this will make the trainable function to only have one argument, since ray internally loosely checks for the extra checkpoint_dir arg at _detect_checkpoint_function which then leads to the deprecation warning being raised here. Feel free to share any info you have on the same.

ddelange · 2023-11-29T06:12:19Z

Hi @AnirudhDagar 👋 glad to hear you're already digging in! I hope they start being a bit stricter about their semantic versioning when 3.0 comes out 😅 they're doing quite some work to converge to a more stable API, e.g. ray-project/ray@f22259b was also shipped in ray 2.7

AnirudhDagar · 2023-12-01T18:05:25Z

@ddelange After some more digging in and reproducing the error in a small self contained code snippet, I raised ray-project/ray#41562, which might help us solve the issue. The ray deprecation warning about checkpoint_dir that causes the CI failure is raised incorrectly imo, which means, if right, this would take an upstream fix before we could update the versions here. And after that we will need to update the signature of hpo_trial and implement the new way to pass checkpoint_dir via ray (which is already mentioned in their dep warning).

autogluon/multimodal/src/autogluon/multimodal/utils/hpo.py

Line 32 in 75c193c

    
           def hpo_trial(sampled_hyperparameters, learner, checkpoint_dir=None, **_fit_args):

I'll do this after the issue has been verified on the ray side. Will keep you posted, thanks for your patience!

ddelange · 2023-12-06T15:12:08Z

fwiw, ray 2.9 will also support pydantic v2 ref ray-project/ray#40451

ddelange · 2023-12-10T18:28:02Z

@AnirudhDagar how about autogluon code temporarily ignores the warning with reference to that issue?

AnirudhDagar · 2023-12-11T05:33:48Z

Thanks for the suggestion @ddelange. I'd be comfortable ignoring deprecation warnings for tests, only after someone at ray can validate the issue and if we can expect a fix for the same soon. Otherwise we end up bumping the dependency and testing autogluon with the latest version of ray, then all tests pass, but in reality at release time the users will run into a deprecation warning which they did not cause. This is not what we want for user experience.

Also I suspect at some point ray will completely throw an error for the same and then simply ignoring the deprecation warning won't cut it.

ddelange · 2023-12-15T17:01:45Z

if I understood his reply correctly, this specific DeprecationWarning can be safely ignored in this case, also in runtime, and will even be removed in a future ray version 👍

AnirudhDagar · 2024-01-05T04:28:27Z

Hi @ddelange sorry for the delayed response, I was vacationing. Yes you are right, with the latest info shared by ray team, that the warning will be removed in 2.10 anyway, we can add support for all the versions by temporarily suppressing those warnings. And as a TODO just add a note in the code, that we need to remove the warning suppression code later.

ddelange · 2024-01-05T05:00:33Z

welcome back! do you feel like hijacking my PR?

AnirudhDagar · 2024-01-05T05:03:43Z

No worries, if you're busy with other things, I can help support your PR with the rest of the required changes. Let me know, thanks!

ddelange · 2024-01-05T07:53:35Z

added warning filter using the message kwarg to regex for checkpoint_dir ref https://docs.python.org/3/library/warnings.html#describing-warning-filters

ddelange · 2024-01-20T18:28:45Z

Hi @AnirudhDagar 👋 can you run CI with elevated permissions?

ddelange · 2024-01-24T19:27:29Z

===End Flaky Test Report=== in https://github.com/autogluon/autogluon/actions/runs/7612795659/job/20731334764?pr=3774#step:4:3062

is it indeed flaky, or can you spot a ray related error?

edit:

[2024-01-22T15:10:05.633Z] Traceback (most recent call last):
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
[2024-01-22T15:10:05.633Z]     result = ray.get(future)
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
[2024-01-22T15:10:05.633Z]     return fn(*args, **kwargs)
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
[2024-01-22T15:10:05.633Z]     return func(*args, **kwargs)
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
[2024-01-22T15:10:05.633Z]     raise value.as_instanceof_cause()
[2024-01-22T15:10:05.633Z] ray.exceptions.RayTaskError(UnboundLocalError): ray::ImplicitFunc.train() (pid=4194, ip=10.0.1.20, actor_id=92d1d7abbb87715c7cb2334401000000, repr=model_trial)
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 342, in train
[2024-01-22T15:10:05.633Z]     raise skipped from exception_cause(skipped)
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/air/_internal/util.py", line 88, in run
[2024-01-22T15:10:05.633Z]     self._ret = self._target(*self._args, **self._kwargs)
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 115, in <lambda>
[2024-01-22T15:10:05.633Z]     training_func=lambda: self._trainable_func(self.config),
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 332, in _trainable_func
[2024-01-22T15:10:05.633Z]     output = fn()
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/tune/trainable/util.py", line 138, in inner
[2024-01-22T15:10:05.633Z]     return trainable(config, **fn_kwargs)
[2024-01-22T15:10:05.633Z]   File "/home/ci/autogluon/timeseries/src/autogluon/timeseries/models/abstract/model_trial.py", line 52, in model_trial
[2024-01-22T15:10:05.633Z]     logger.error(f"\tWarning: Exception caused {model.name} to fail during training... Skipping this model.")
[2024-01-22T15:10:05.633Z] UnboundLocalError: local variable 'model' referenced before assignment

AnirudhDagar · 2024-01-24T19:51:57Z

No we still got an issue with ray. You can look at other workflows and see the ignore doesn't work. The problem lies im the way the warning is raised as an exception and not using warnins.warn

You can try the following, even after adding a warning ignore filter, the deprecation warning will still come up as it's an Exception. Sharing a MRE below. pip install ray==2.9.1 before running the snippet below in ipython.

import warnings
from ray import tune

# try ignoring all DeprecationWarnings
warnings.simplefilter("ignore", category=DeprecationWarning)

trainable_kwargs = {'kwarg1': 1, 'kwarg2': 2}
def train_fn(config, **trainable_kwargs):
    pass

# Create the tuner without seeing the DeprecationWarning
tuner = tune.Tuner(tune.with_parameters(train_fn, **trainable_kwargs))

AnirudhDagar · 2024-01-25T12:27:07Z

===End Flaky Test Report=== in https://github.com/autogluon/autogluon/actions/runs/7612795659/job/20731334764?pr=3774#step:4:3062
is it indeed flaky, or can you spot a ray related error?

@ddelange I'll investigate what's going on there, and let you know.

yinweisu · 2024-04-01T10:40:13Z

Previous CI Run	Current CI Run

AnirudhDagar · 2024-04-01T10:41:42Z

@ddelange Hopefully my latest commits will fix the failing tests. Turns out ray silently also deprecated local_dir arg in ray-project/ray#33463 at https://github.com/ray-project/ray/blob/master/python/ray/air/config.py#L656-L657 and replaced it with storage_dir. We were still using the local_dir arg which wasn't cutting it, and ray was falling back to its default save path ~/ray_results.

Note that we are using a private method _get_trial_checkpoints_with_metric (which is the most direct replacement to the deprecated func, and I would like to keep the changes minimal) to get the best trial checkpoint paths and metrics, which we could probably improve/change later (since ray is notoriously deprecating and breaking APIs with releases).

ddelange · 2024-04-01T10:46:58Z

awesome finds! the beakages are a pain.. hope the API will converge after their next major release!

AnirudhDagar · 2024-04-01T11:25:01Z

@Innixma it would be great if we could get this in for v1.1 release somehow. I'll wait for the reviews, since it touches core, timeseries and multimodal.

AnirudhDagar · 2024-04-01T11:44:05Z

should also unblock #3920 (cc @suzhoum)

github-actions · 2024-04-01T13:10:49Z

Job PR-3774-e666415 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3774/e666415/index.html

shchur

Thanks for looking into this! Looks good to me, feel free to merge after checking that the platform tests are passing

core/setup.py

AnirudhDagar · 2024-04-02T20:00:16Z

Here are the platform tests for this: ~~https://github.com/autogluon/autogluon/actions/runs/8528660434~~
https://github.com/autogluon/autogluon/actions/runs/8552100922

Innixma

LGTM!

We ran a benchmark and results look stable. Thanks everyone for helping to push this version upgrade!

AnirudhDagar · 2024-04-04T19:21:51Z

Thanks all, especially @ddelange for pushing this PR constantly for a long time. 🙌🏼

ddelange · 2024-04-05T14:02:27Z

I think tests missed a deprecation somehow:

Warning: Exception caused NeuralNetTorch_BAG_L1 to fail during hyperparameter tuning... Skipping this model.
Traceback (most recent call last):
  File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/trainer/abstract_trainer.py", line 2159, in _train_single_full
    hpo_models, hpo_results = model.hyperparameter_tune(
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/models/abstract/abstract_model.py", line 1474, in hyperparameter_tune
    return self._hyperparameter_tune(hpo_executor=hpo_executor, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/models/ensemble/stacker_ensemble_model.py", line 184, in _hyperparameter_tune
    return super()._hyperparameter_tune(X=X, y=y, k_fold=k_fold, hpo_executor=hpo_executor, preprocess_kwargs=preprocess_kwargs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/models/ensemble/bagged_ensemble_model.py", line 1384, in _hyperparameter_tune
    hpo_executor.execute(
  File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/hpo/executors.py", line 408, in execute
    analysis = run(
               ^^^^
  File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/hpo/ray_hpo.py", line 276, in run
    results = tuner.fit()
              ^^^^^^^^^^^
  File "/usr/share/python3/app/lib/python3.11/site-packages/ray/tune/tuner.py", line 379, in fit
    return self._local_tuner.fit()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/python3/app/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py", line 477, in fit
    analysis = self._fit_internal(trainable, param_space)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/python3/app/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py", line 596, in _fit_internal
    analysis = run(
               ^^^^
  File "/usr/share/python3/app/lib/python3.11/site-packages/ray/tune/tune.py", line 701, in run
    raise DeprecationWarning(
DeprecationWarning: ('`chdir_to_trial_dir` is deprecated. Use the RAY_CHDIR_TO_TRIAL_DIR environment variable instead. Set it to 0 to disable the default behavior of changing the working directory.', <class 'DeprecationWarning'>)
('`chdir_to_trial_dir` is deprecated. Use the RAY_CHDIR_TO_TRIAL_DIR environment variable instead. Set it to 0 to disable the default behavior of changing the working directory.', <class 'DeprecationWarning'>)

new API here: ray-project/ray@8b337ad

AnirudhDagar · 2024-04-05T14:17:48Z

@ddelange could you share a MRE for the above error/warning? I'm curious if this deprecation warning results in an error (like a hard deprecation warning which is raised) or is just a dep warning spit out and logged in the console during hparam tuning? In either case we'd have to fix it, but the latter might be easier.

ddelange · 2024-04-05T17:17:36Z

like a hard deprecation warning which is raised

they raise it and so it short-circuits the training when installing from latest master (and thus installing ray 2.10.0)

ddelange · 2024-04-05T17:20:26Z

for some reason, the kwarg is not being passed in the timeseries module 🤔

autogluon/timeseries/src/autogluon/timeseries/models/abstract/abstract_timeseries_model.py

Line 484 in 3cec54a

hpo_executor.execute(

AnirudhDagar · 2024-04-09T16:38:45Z

@ddelange can you share the reproduction script, that would be really helpful? Thanks!

ddelange · 2024-04-10T06:02:00Z

here you go (notably with hyperparameter_tune_kwargs='auto'):

# install current latest commit on master
# pip install -U 'autogluon.core @ https://github.com/autogluon/autogluon/archive/refs/heads/a48be1cda74479761161354ec7b360f9eeb7314f.zip#subdirectory=core' 'autogluon.common @ https://github.com/autogluon/autogluon/archive/refs/heads/a48be1cda74479761161354ec7b360f9eeb7314f.zip#subdirectory=common' 'autogluon.features @ https://github.com/autogluon/autogluon/archive/refs/heads/a48be1cda74479761161354ec7b360f9eeb7314f.zip#subdirectory=features' 'autogluon.tabular[tests,fastai] @ https://github.com/autogluon/autogluon/archive/refs/heads/a48be1cda74479761161354ec7b360f9eeb7314f.zip#subdirectory=tabular'

from autogluon.tabular import TabularPredictor
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 2, size=(100, 100)))
df.columns = [str(col) for col in df.columns]

TabularPredictor(
    verbosity=3, 
    label='0', 
    path='/tmp/ag',
).fit(
    df, 
    time_limit=100, 
    hyperparameters={"NN_TORCH": {}}, 
    hyperparameter_tune_kwargs='auto',
)

AnirudhDagar · 2024-04-10T07:08:51Z

Thanks @ddelange, I spent some time playing around and diving deep into the error, I can confirm the issue. This is indeed a major bug, I'll open a new issue to track it and send a fix ASAP.

Thanks again for smoking the ray upgrade, your contributions are really valuable!

Co-authored-by: Anirudh Dagar <anirudhdagar6@gmail.com>

ddelange added 3 commits November 28, 2023 12:44

Allow ray 2.7

606db28

Add comment

b6b73e3

Sync ray version

b979864

Innixma added this to the 1.1 Release milestone Nov 28, 2023

Innixma requested a review from AnirudhDagar November 28, 2023 17:23

Innixma added the dependency Related to dependency packages label Nov 28, 2023

AnirudhDagar reviewed Nov 29, 2023

View reviewed changes

ddelange added 2 commits December 21, 2023 12:22

Merge branch 'master' into patch-4

3a0a055

Allow ray 2.9

011b5c5

ddelange changed the title ~~Allow ray 2.7~~ Loosen ray requirement Dec 21, 2023

Update try_import.py

232d872

Add DeprecationWarning filter

161d487

AnirudhDagar added the run-multi-gpu Run multimodal multi-gpu tests label Jan 22, 2024

fix black linting

798d5e1

Innixma modified the milestones: 1.1 Release, 1.0.1 Release Jan 25, 2024

replace deprecated arg local_dir with storage_path

e666415

AnirudhDagar requested review from shchur, Innixma and zhiqiangdon April 1, 2024 11:21

shchur approved these changes Apr 2, 2024

View reviewed changes

core/setup.py Show resolved Hide resolved

AnirudhDagar mentioned this pull request Apr 2, 2024

[Do Not Merge] Ray 2.10 upgrade platform testing #4026

Closed

Innixma added the priority: 0 Maximum priority label Apr 2, 2024

Innixma mentioned this pull request Apr 2, 2024

1.1 Release Tracker #4027

Closed

23 tasks

AnirudhDagar approved these changes Apr 4, 2024

View reviewed changes

Innixma approved these changes Apr 4, 2024

View reviewed changes

Innixma merged commit 273aaf9 into autogluon:master Apr 4, 2024

AnirudhDagar mentioned this pull request Apr 9, 2024

[BUG] Pip install with python 3.9 and 3.10 fails on windows at ray download #3908

Closed

3 tasks

Innixma mentioned this pull request Apr 16, 2024

[BUG] ValueError: assignment destination is read-only when dynamic stacking is used with validation data #3964

Closed

3 tasks

Newtoniano mentioned this pull request May 14, 2024

[BUG] GPU training not working with non-default tabular presets #4196

Closed

3 tasks

LennartPurucker pushed a commit to LennartPurucker/autogluon that referenced this pull request Jun 1, 2024

Bump ray to 2.10 (autogluon#3774)

920f4f1

Co-authored-by: Anirudh Dagar <anirudhdagar6@gmail.com>

Bump ray to 2.10 #3774

Bump ray to 2.10 #3774

Uh oh!

Conversation

ddelange commented Nov 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AnirudhDagar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ddelange commented Nov 29, 2023

Uh oh!

AnirudhDagar commented Dec 1, 2023

Uh oh!

ddelange commented Dec 6, 2023

Uh oh!

ddelange commented Dec 10, 2023

Uh oh!

AnirudhDagar commented Dec 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddelange commented Dec 15, 2023

Uh oh!

AnirudhDagar commented Jan 5, 2024

Uh oh!

ddelange commented Jan 5, 2024

Uh oh!

AnirudhDagar commented Jan 5, 2024

Uh oh!

ddelange commented Jan 5, 2024

Uh oh!

ddelange commented Jan 20, 2024

Uh oh!

ddelange commented Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AnirudhDagar commented Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AnirudhDagar commented Jan 25, 2024

Uh oh!

yinweisu commented Apr 1, 2024

Uh oh!

AnirudhDagar commented Apr 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddelange commented Apr 1, 2024

Uh oh!

AnirudhDagar commented Apr 1, 2024

Uh oh!

AnirudhDagar commented Apr 1, 2024

Uh oh!

github-actions bot commented Apr 1, 2024

Uh oh!

shchur left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AnirudhDagar commented Apr 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Innixma left a comment

Choose a reason for hiding this comment

Uh oh!

AnirudhDagar commented Apr 4, 2024

Uh oh!

ddelange commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AnirudhDagar commented Apr 5, 2024

Uh oh!

ddelange commented Apr 5, 2024

Uh oh!

ddelange commented Apr 5, 2024

Uh oh!

AnirudhDagar commented Apr 9, 2024

Uh oh!

ddelange commented Apr 10, 2024 • edited by AnirudhDagar Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ddelange commented Nov 28, 2023 •

edited

Loading

AnirudhDagar left a comment •

edited

Loading

AnirudhDagar commented Dec 11, 2023 •

edited

Loading

ddelange commented Jan 24, 2024 •

edited

Loading

AnirudhDagar commented Jan 24, 2024 •

edited

Loading

AnirudhDagar commented Apr 1, 2024 •

edited

Loading

AnirudhDagar commented Apr 2, 2024 •

edited

Loading

ddelange commented Apr 5, 2024 •

edited

Loading

ddelange commented Apr 10, 2024 •

edited by AnirudhDagar

Loading