Skip to content

Conversation

ddelange
Copy link
Contributor

@ddelange ddelange commented Nov 28, 2023

Issue #, if available: #2687 (comment)

Description of changes: Allow ray 2.7 which includes critical fixes for py 3.11

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Fix #3807, fix #3920

@Innixma Innixma added this to the 1.1 Release milestone Nov 28, 2023
@Innixma Innixma requested a review from AnirudhDagar November 28, 2023 17:23
@Innixma Innixma added the dependency Related to dependency packages label Nov 28, 2023
Copy link
Contributor

@AnirudhDagar AnirudhDagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ddelange, thanks for submitting the PR. ray 2.7 introduced some hard deprecations, the failed tests here are due to those deprecation warnings caused by the trainer taking in the checkpoint_dir argument. I'm currently working on a fix for the same, this will make the trainable function to only have one argument, since ray internally loosely checks for the extra checkpoint_dir arg at _detect_checkpoint_function which then leads to the deprecation warning being raised here. Feel free to share any info you have on the same.

@ddelange
Copy link
Contributor Author

Hi @AnirudhDagar 👋 glad to hear you're already digging in! I hope they start being a bit stricter about their semantic versioning when 3.0 comes out 😅 they're doing quite some work to converge to a more stable API, e.g. ray-project/ray@f22259b was also shipped in ray 2.7

@AnirudhDagar
Copy link
Contributor

@ddelange After some more digging in and reproducing the error in a small self contained code snippet, I raised ray-project/ray#41562, which might help us solve the issue. The ray deprecation warning about checkpoint_dir that causes the CI failure is raised incorrectly imo, which means, if right, this would take an upstream fix before we could update the versions here. And after that we will need to update the signature of hpo_trial and implement the new way to pass checkpoint_dir via ray (which is already mentioned in their dep warning).

def hpo_trial(sampled_hyperparameters, learner, checkpoint_dir=None, **_fit_args):

I'll do this after the issue has been verified on the ray side. Will keep you posted, thanks for your patience!

@ddelange
Copy link
Contributor Author

ddelange commented Dec 6, 2023

fwiw, ray 2.9 will also support pydantic v2 ref ray-project/ray#40451

@ddelange
Copy link
Contributor Author

@AnirudhDagar how about autogluon code temporarily ignores the warning with reference to that issue?

@AnirudhDagar
Copy link
Contributor

AnirudhDagar commented Dec 11, 2023

Thanks for the suggestion @ddelange. I'd be comfortable ignoring deprecation warnings for tests, only after someone at ray can validate the issue and if we can expect a fix for the same soon. Otherwise we end up bumping the dependency and testing autogluon with the latest version of ray, then all tests pass, but in reality at release time the users will run into a deprecation warning which they did not cause. This is not what we want for user experience.

Also I suspect at some point ray will completely throw an error for the same and then simply ignoring the deprecation warning won't cut it.

@ddelange
Copy link
Contributor Author

if I understood his reply correctly, this specific DeprecationWarning can be safely ignored in this case, also in runtime, and will even be removed in a future ray version 👍

@ddelange ddelange changed the title Allow ray 2.7 Loosen ray requirement Dec 21, 2023
@AnirudhDagar
Copy link
Contributor

Hi @ddelange sorry for the delayed response, I was vacationing. Yes you are right, with the latest info shared by ray team, that the warning will be removed in 2.10 anyway, we can add support for all the versions by temporarily suppressing those warnings. And as a TODO just add a note in the code, that we need to remove the warning suppression code later.

@ddelange
Copy link
Contributor Author

ddelange commented Jan 5, 2024

welcome back! do you feel like hijacking my PR?

@AnirudhDagar
Copy link
Contributor

No worries, if you're busy with other things, I can help support your PR with the rest of the required changes. Let me know, thanks!

@ddelange
Copy link
Contributor Author

ddelange commented Jan 5, 2024

added warning filter using the message kwarg to regex for checkpoint_dir ref https://docs.python.org/3/library/warnings.html#describing-warning-filters

@ddelange
Copy link
Contributor Author

Hi @AnirudhDagar 👋 can you run CI with elevated permissions?

@AnirudhDagar AnirudhDagar added the run-multi-gpu Run multimodal multi-gpu tests label Jan 22, 2024
@ddelange
Copy link
Contributor Author

ddelange commented Jan 24, 2024

===End Flaky Test Report=== in https://github.com/autogluon/autogluon/actions/runs/7612795659/job/20731334764?pr=3774#step:4:3062

is it indeed flaky, or can you spot a ray related error?

edit:

[2024-01-22T15:10:05.633Z] Traceback (most recent call last):
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
[2024-01-22T15:10:05.633Z]     result = ray.get(future)
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
[2024-01-22T15:10:05.633Z]     return fn(*args, **kwargs)
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
[2024-01-22T15:10:05.633Z]     return func(*args, **kwargs)
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
[2024-01-22T15:10:05.633Z]     raise value.as_instanceof_cause()
[2024-01-22T15:10:05.633Z] ray.exceptions.RayTaskError(UnboundLocalError): ray::ImplicitFunc.train() (pid=4194, ip=10.0.1.20, actor_id=92d1d7abbb87715c7cb2334401000000, repr=model_trial)
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 342, in train
[2024-01-22T15:10:05.633Z]     raise skipped from exception_cause(skipped)
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/air/_internal/util.py", line 88, in run
[2024-01-22T15:10:05.633Z]     self._ret = self._target(*self._args, **self._kwargs)
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 115, in <lambda>
[2024-01-22T15:10:05.633Z]     training_func=lambda: self._trainable_func(self.config),
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 332, in _trainable_func
[2024-01-22T15:10:05.633Z]     output = fn()
[2024-01-22T15:10:05.633Z]   File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/tune/trainable/util.py", line 138, in inner
[2024-01-22T15:10:05.633Z]     return trainable(config, **fn_kwargs)
[2024-01-22T15:10:05.633Z]   File "/home/ci/autogluon/timeseries/src/autogluon/timeseries/models/abstract/model_trial.py", line 52, in model_trial
[2024-01-22T15:10:05.633Z]     logger.error(f"\tWarning: Exception caused {model.name} to fail during training... Skipping this model.")
[2024-01-22T15:10:05.633Z] UnboundLocalError: local variable 'model' referenced before assignment

@AnirudhDagar
Copy link
Contributor

AnirudhDagar commented Jan 24, 2024

No we still got an issue with ray. You can look at other workflows and see the ignore doesn't work. The problem lies im the way the warning is raised as an exception and not using warnins.warn

You can try the following, even after adding a warning ignore filter, the deprecation warning will still come up as it's an Exception. Sharing a MRE below. pip install ray==2.9.1 before running the snippet below in ipython.

import warnings
from ray import tune

# try ignoring all DeprecationWarnings
warnings.simplefilter("ignore", category=DeprecationWarning)

trainable_kwargs = {'kwarg1': 1, 'kwarg2': 2}
def train_fn(config, **trainable_kwargs):
    pass

# Create the tuner without seeing the DeprecationWarning
tuner = tune.Tuner(tune.with_parameters(train_fn, **trainable_kwargs))

@AnirudhDagar
Copy link
Contributor

===End Flaky Test Report=== in https://github.com/autogluon/autogluon/actions/runs/7612795659/job/20731334764?pr=3774#step:4:3062
is it indeed flaky, or can you spot a ray related error?

@ddelange I'll investigate what's going on there, and let you know.

@Innixma Innixma modified the milestones: 1.1 Release, 1.0.1 Release Jan 25, 2024
@yinweisu
Copy link
Contributor

yinweisu commented Apr 1, 2024

Previous CI Run Current CI Run

@AnirudhDagar
Copy link
Contributor

AnirudhDagar commented Apr 1, 2024

@ddelange Hopefully my latest commits will fix the failing tests. Turns out ray silently also deprecated local_dir arg in ray-project/ray#33463 at https://github.com/ray-project/ray/blob/master/python/ray/air/config.py#L656-L657 and replaced it with storage_dir. We were still using the local_dir arg which wasn't cutting it, and ray was falling back to its default save path ~/ray_results.

Note that we are using a private method _get_trial_checkpoints_with_metric (which is the most direct replacement to the deprecated func, and I would like to keep the changes minimal) to get the best trial checkpoint paths and metrics, which we could probably improve/change later (since ray is notoriously deprecating and breaking APIs with releases).

@ddelange
Copy link
Contributor Author

ddelange commented Apr 1, 2024

awesome finds! the beakages are a pain.. hope the API will converge after their next major release!

@AnirudhDagar
Copy link
Contributor

@Innixma it would be great if we could get this in for v1.1 release somehow. I'll wait for the reviews, since it touches core, timeseries and multimodal.

@AnirudhDagar
Copy link
Contributor

should also unblock #3920 (cc @suzhoum)

Copy link

github-actions bot commented Apr 1, 2024

Job PR-3774-e666415 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3774/e666415/index.html

Copy link
Collaborator

@shchur shchur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this! Looks good to me, feel free to merge after checking that the platform tests are passing

@Innixma Innixma added the priority: 0 Maximum priority label Apr 2, 2024
@Innixma Innixma mentioned this pull request Apr 2, 2024
23 tasks
@AnirudhDagar
Copy link
Contributor

AnirudhDagar commented Apr 2, 2024

Copy link
Contributor

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

We ran a benchmark and results look stable. Thanks everyone for helping to push this version upgrade!

@Innixma Innixma merged commit 273aaf9 into autogluon:master Apr 4, 2024
@AnirudhDagar
Copy link
Contributor

Thanks all, especially @ddelange for pushing this PR constantly for a long time. 🙌🏼

@ddelange
Copy link
Contributor Author

ddelange commented Apr 5, 2024

I think tests missed a deprecation somehow:

Warning: Exception caused NeuralNetTorch_BAG_L1 to fail during hyperparameter tuning... Skipping this model.
Traceback (most recent call last):
  File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/trainer/abstract_trainer.py", line 2159, in _train_single_full
    hpo_models, hpo_results = model.hyperparameter_tune(
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/models/abstract/abstract_model.py", line 1474, in hyperparameter_tune
    return self._hyperparameter_tune(hpo_executor=hpo_executor, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/models/ensemble/stacker_ensemble_model.py", line 184, in _hyperparameter_tune
    return super()._hyperparameter_tune(X=X, y=y, k_fold=k_fold, hpo_executor=hpo_executor, preprocess_kwargs=preprocess_kwargs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/models/ensemble/bagged_ensemble_model.py", line 1384, in _hyperparameter_tune
    hpo_executor.execute(
  File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/hpo/executors.py", line 408, in execute
    analysis = run(
               ^^^^
  File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/hpo/ray_hpo.py", line 276, in run
    results = tuner.fit()
              ^^^^^^^^^^^
  File "/usr/share/python3/app/lib/python3.11/site-packages/ray/tune/tuner.py", line 379, in fit
    return self._local_tuner.fit()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/python3/app/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py", line 477, in fit
    analysis = self._fit_internal(trainable, param_space)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/python3/app/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py", line 596, in _fit_internal
    analysis = run(
               ^^^^
  File "/usr/share/python3/app/lib/python3.11/site-packages/ray/tune/tune.py", line 701, in run
    raise DeprecationWarning(
DeprecationWarning: ('`chdir_to_trial_dir` is deprecated. Use the RAY_CHDIR_TO_TRIAL_DIR environment variable instead. Set it to 0 to disable the default behavior of changing the working directory.', <class 'DeprecationWarning'>)
('`chdir_to_trial_dir` is deprecated. Use the RAY_CHDIR_TO_TRIAL_DIR environment variable instead. Set it to 0 to disable the default behavior of changing the working directory.', <class 'DeprecationWarning'>)

new API here: ray-project/ray@8b337ad

@AnirudhDagar
Copy link
Contributor

@ddelange could you share a MRE for the above error/warning? I'm curious if this deprecation warning results in an error (like a hard deprecation warning which is raised) or is just a dep warning spit out and logged in the console during hparam tuning? In either case we'd have to fix it, but the latter might be easier.

@ddelange
Copy link
Contributor Author

ddelange commented Apr 5, 2024

like a hard deprecation warning which is raised

they raise it and so it short-circuits the training when installing from latest master (and thus installing ray 2.10.0)

@ddelange
Copy link
Contributor Author

ddelange commented Apr 5, 2024

for some reason, the kwarg is not being passed in the timeseries module 🤔

@AnirudhDagar
Copy link
Contributor

@ddelange can you share the reproduction script, that would be really helpful? Thanks!

@ddelange
Copy link
Contributor Author

ddelange commented Apr 10, 2024

here you go (notably with hyperparameter_tune_kwargs='auto'):

# install current latest commit on master
# pip install -U 'autogluon.core @ https://github.com/autogluon/autogluon/archive/refs/heads/a48be1cda74479761161354ec7b360f9eeb7314f.zip#subdirectory=core' 'autogluon.common @ https://github.com/autogluon/autogluon/archive/refs/heads/a48be1cda74479761161354ec7b360f9eeb7314f.zip#subdirectory=common' 'autogluon.features @ https://github.com/autogluon/autogluon/archive/refs/heads/a48be1cda74479761161354ec7b360f9eeb7314f.zip#subdirectory=features' 'autogluon.tabular[tests,fastai] @ https://github.com/autogluon/autogluon/archive/refs/heads/a48be1cda74479761161354ec7b360f9eeb7314f.zip#subdirectory=tabular'

from autogluon.tabular import TabularPredictor
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 2, size=(100, 100)))
df.columns = [str(col) for col in df.columns]

TabularPredictor(
    verbosity=3, 
    label='0', 
    path='/tmp/ag',
).fit(
    df, 
    time_limit=100, 
    hyperparameters={"NN_TORCH": {}}, 
    hyperparameter_tune_kwargs='auto',
)

@AnirudhDagar
Copy link
Contributor

Thanks @ddelange, I spent some time playing around and diving deep into the error, I can confirm the issue. This is indeed a major bug, I'll open a new issue to track it and send a fix ASAP.

Thanks again for smoking the ray upgrade, your contributions are really valuable!

LennartPurucker pushed a commit to LennartPurucker/autogluon that referenced this pull request Jun 1, 2024
Co-authored-by: Anirudh Dagar <anirudhdagar6@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependency Related to dependency packages priority: 0 Maximum priority run-multi-gpu Run multimodal multi-gpu tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conda does not install ray by default [BUG] Python 3.11 installation fails on Windows
5 participants