-
Notifications
You must be signed in to change notification settings - Fork 1k
Bump ray to 2.10 #3774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump ray to 2.10 #3774
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ddelange, thanks for submitting the PR. ray 2.7 introduced some hard deprecations, the failed tests here are due to those deprecation warnings caused by the trainer taking in the checkpoint_dir
argument. I'm currently working on a fix for the same, this will make the trainable
function to only have one argument, since ray internally loosely checks for the extra checkpoint_dir
arg at _detect_checkpoint_function which then leads to the deprecation warning being raised here. Feel free to share any info you have on the same.
Hi @AnirudhDagar 👋 glad to hear you're already digging in! I hope they start being a bit stricter about their semantic versioning when 3.0 comes out 😅 they're doing quite some work to converge to a more stable API, e.g. ray-project/ray@f22259b was also shipped in ray 2.7 |
@ddelange After some more digging in and reproducing the error in a small self contained code snippet, I raised ray-project/ray#41562, which might help us solve the issue. The ray deprecation warning about
I'll do this after the issue has been verified on the ray side. Will keep you posted, thanks for your patience! |
fwiw, ray 2.9 will also support pydantic v2 ref ray-project/ray#40451 |
@AnirudhDagar how about autogluon code temporarily ignores the warning with reference to that issue? |
Thanks for the suggestion @ddelange. I'd be comfortable ignoring deprecation warnings for tests, only after someone at ray can validate the issue and if we can expect a fix for the same soon. Otherwise we end up bumping the dependency and testing autogluon with the latest version of ray, then all tests pass, but in reality at release time the users will run into a deprecation warning which they did not cause. This is not what we want for user experience. Also I suspect at some point ray will completely throw an error for the same and then simply ignoring the deprecation warning won't cut it. |
if I understood his reply correctly, this specific DeprecationWarning can be safely ignored in this case, also in runtime, and will even be removed in a future ray version 👍 |
Hi @ddelange sorry for the delayed response, I was vacationing. Yes you are right, with the latest info shared by ray team, that the warning will be removed in 2.10 anyway, we can add support for all the versions by temporarily suppressing those warnings. And as a TODO just add a note in the code, that we need to remove the warning suppression code later. |
welcome back! do you feel like hijacking my PR? |
No worries, if you're busy with other things, I can help support your PR with the rest of the required changes. Let me know, thanks! |
added warning filter using the |
Hi @AnirudhDagar 👋 can you run CI with elevated permissions? |
is it indeed flaky, or can you spot a ray related error? edit: [2024-01-22T15:10:05.633Z] Traceback (most recent call last):
[2024-01-22T15:10:05.633Z] File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
[2024-01-22T15:10:05.633Z] result = ray.get(future)
[2024-01-22T15:10:05.633Z] File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
[2024-01-22T15:10:05.633Z] return fn(*args, **kwargs)
[2024-01-22T15:10:05.633Z] File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
[2024-01-22T15:10:05.633Z] return func(*args, **kwargs)
[2024-01-22T15:10:05.633Z] File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
[2024-01-22T15:10:05.633Z] raise value.as_instanceof_cause()
[2024-01-22T15:10:05.633Z] ray.exceptions.RayTaskError(UnboundLocalError): ray::ImplicitFunc.train() (pid=4194, ip=10.0.1.20, actor_id=92d1d7abbb87715c7cb2334401000000, repr=model_trial)
[2024-01-22T15:10:05.633Z] File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 342, in train
[2024-01-22T15:10:05.633Z] raise skipped from exception_cause(skipped)
[2024-01-22T15:10:05.633Z] File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/air/_internal/util.py", line 88, in run
[2024-01-22T15:10:05.633Z] self._ret = self._target(*self._args, **self._kwargs)
[2024-01-22T15:10:05.633Z] File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 115, in <lambda>
[2024-01-22T15:10:05.633Z] training_func=lambda: self._trainable_func(self.config),
[2024-01-22T15:10:05.633Z] File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 332, in _trainable_func
[2024-01-22T15:10:05.633Z] output = fn()
[2024-01-22T15:10:05.633Z] File "/home/ci/opt/venv/lib/python3.10/site-packages/ray/tune/trainable/util.py", line 138, in inner
[2024-01-22T15:10:05.633Z] return trainable(config, **fn_kwargs)
[2024-01-22T15:10:05.633Z] File "/home/ci/autogluon/timeseries/src/autogluon/timeseries/models/abstract/model_trial.py", line 52, in model_trial
[2024-01-22T15:10:05.633Z] logger.error(f"\tWarning: Exception caused {model.name} to fail during training... Skipping this model.")
[2024-01-22T15:10:05.633Z] UnboundLocalError: local variable 'model' referenced before assignment |
No we still got an issue with ray. You can look at other workflows and see the ignore doesn't work. The problem lies im the way the warning is raised as an exception and not using You can try the following, even after adding a warning ignore filter, the deprecation warning will still come up as it's an Exception. Sharing a MRE below. pip install ray==2.9.1 before running the snippet below in ipython.
|
@ddelange I'll investigate what's going on there, and let you know. |
|
@ddelange Hopefully my latest commits will fix the failing tests. Turns out ray silently also deprecated Note that we are using a private method |
awesome finds! the beakages are a pain.. hope the API will converge after their next major release! |
@Innixma it would be great if we could get this in for v1.1 release somehow. I'll wait for the reviews, since it touches core, timeseries and multimodal. |
Job PR-3774-e666415 is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this! Looks good to me, feel free to merge after checking that the platform tests are passing
Here are the platform tests for this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
We ran a benchmark and results look stable. Thanks everyone for helping to push this version upgrade!
Thanks all, especially @ddelange for pushing this PR constantly for a long time. 🙌🏼 |
I think tests missed a deprecation somehow: Warning: Exception caused NeuralNetTorch_BAG_L1 to fail during hyperparameter tuning... Skipping this model.
Traceback (most recent call last):
File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/trainer/abstract_trainer.py", line 2159, in _train_single_full
hpo_models, hpo_results = model.hyperparameter_tune(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/models/abstract/abstract_model.py", line 1474, in hyperparameter_tune
return self._hyperparameter_tune(hpo_executor=hpo_executor, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/models/ensemble/stacker_ensemble_model.py", line 184, in _hyperparameter_tune
return super()._hyperparameter_tune(X=X, y=y, k_fold=k_fold, hpo_executor=hpo_executor, preprocess_kwargs=preprocess_kwargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/models/ensemble/bagged_ensemble_model.py", line 1384, in _hyperparameter_tune
hpo_executor.execute(
File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/hpo/executors.py", line 408, in execute
analysis = run(
^^^^
File "/usr/share/python3/app/lib/python3.11/site-packages/autogluon/core/hpo/ray_hpo.py", line 276, in run
results = tuner.fit()
^^^^^^^^^^^
File "/usr/share/python3/app/lib/python3.11/site-packages/ray/tune/tuner.py", line 379, in fit
return self._local_tuner.fit()
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/share/python3/app/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py", line 477, in fit
analysis = self._fit_internal(trainable, param_space)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/share/python3/app/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py", line 596, in _fit_internal
analysis = run(
^^^^
File "/usr/share/python3/app/lib/python3.11/site-packages/ray/tune/tune.py", line 701, in run
raise DeprecationWarning(
DeprecationWarning: ('`chdir_to_trial_dir` is deprecated. Use the RAY_CHDIR_TO_TRIAL_DIR environment variable instead. Set it to 0 to disable the default behavior of changing the working directory.', <class 'DeprecationWarning'>)
('`chdir_to_trial_dir` is deprecated. Use the RAY_CHDIR_TO_TRIAL_DIR environment variable instead. Set it to 0 to disable the default behavior of changing the working directory.', <class 'DeprecationWarning'>) new API here: ray-project/ray@8b337ad |
@ddelange could you share a MRE for the above error/warning? I'm curious if this deprecation warning results in an error (like a hard deprecation warning which is raised) or is just a dep warning spit out and logged in the console during hparam tuning? In either case we'd have to fix it, but the latter might be easier. |
they raise it and so it short-circuits the training when installing from latest master (and thus installing ray 2.10.0) |
for some reason, the kwarg is not being passed in the timeseries module 🤔 autogluon/timeseries/src/autogluon/timeseries/models/abstract/abstract_timeseries_model.py Line 484 in 3cec54a
|
@ddelange can you share the reproduction script, that would be really helpful? Thanks! |
here you go (notably with # install current latest commit on master
# pip install -U 'autogluon.core @ https://github.com/autogluon/autogluon/archive/refs/heads/a48be1cda74479761161354ec7b360f9eeb7314f.zip#subdirectory=core' 'autogluon.common @ https://github.com/autogluon/autogluon/archive/refs/heads/a48be1cda74479761161354ec7b360f9eeb7314f.zip#subdirectory=common' 'autogluon.features @ https://github.com/autogluon/autogluon/archive/refs/heads/a48be1cda74479761161354ec7b360f9eeb7314f.zip#subdirectory=features' 'autogluon.tabular[tests,fastai] @ https://github.com/autogluon/autogluon/archive/refs/heads/a48be1cda74479761161354ec7b360f9eeb7314f.zip#subdirectory=tabular'
from autogluon.tabular import TabularPredictor
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 2, size=(100, 100)))
df.columns = [str(col) for col in df.columns]
TabularPredictor(
verbosity=3,
label='0',
path='/tmp/ag',
).fit(
df,
time_limit=100,
hyperparameters={"NN_TORCH": {}},
hyperparameter_tune_kwargs='auto',
) |
Thanks @ddelange, I spent some time playing around and diving deep into the error, I can confirm the issue. This is indeed a major bug, I'll open a new issue to track it and send a fix ASAP. Thanks again for smoking the ray upgrade, your contributions are really valuable! |
Co-authored-by: Anirudh Dagar <anirudhdagar6@gmail.com>
Issue #, if available: #2687 (comment)
Description of changes: Allow ray 2.7 which includes critical fixes for py 3.11
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Fix #3807, fix #3920