-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Bug Report Checklist
- I provided code that demonstrates a minimal reproducible example.
- I confirmed bug exists on the latest mainline of AutoGluon via source install.
- I confirmed bug exists on the latest stable version of AutoGluon.
Describe the bug
When training a TimeSeriesPredictor, we are copying the resulting model into blob storage and then downloading it from blob storage into a local directory at prediction time. All the non-tabular models work well, but when the best model is one of RecursiveTabular or DirectTabular, the TimeSeriesPredictor fails as it is unable to find the pkl
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/autogluon/timeseries/trainer/abstract_trainer.py", line 944, in get_model_pred_dict
model_pred_dict[model_name] = self._predict_model(
File "/usr/local/lib/python3.10/site-packages/autogluon/timeseries/trainer/abstract_trainer.py", line 874, in _predict_model
return model.predict(data, known_covariates=known_covariates)
File "/usr/local/lib/python3.10/site-packages/autogluon/timeseries/models/abstract/abstract_timeseries_model.py", line 298, in predict
predictions = self._predict(data=data, known_covariates=known_covariates, **kwargs)
File "/usr/local/lib/python3.10/site-packages/autogluon/timeseries/models/multi_window/multi_window_model.py", line 177, in _predict
return self.most_recent_model.predict(data, known_covariates=known_covariates, **kwargs)
File "/usr/local/lib/python3.10/site-packages/autogluon/timeseries/models/abstract/abstract_timeseries_model.py", line 298, in predict
predictions = self._predict(data=data, known_covariates=known_covariates, **kwargs)
File "/usr/local/lib/python3.10/site-packages/autogluon/timeseries/models/autogluon_tabular/mlforecast.py", line 467, in _predict
raw_predictions = self._mlf.models_["mean"].predict(df)
File "/usr/local/lib/python3.10/site-packages/autogluon/timeseries/models/autogluon_tabular/mlforecast.py", line 55, in predict
return self.predictor.predict(X).values
File "/usr/local/lib/python3.10/site-packages/autogluon/tabular/predictor/predictor.py", line 1931, in predict
return self._learner.predict(X=data, model=model, as_pandas=as_pandas, transform_features=transform_features, decision_threshold=decision_threshold)
File "/usr/local/lib/python3.10/site-packages/autogluon/tabular/learner/abstract_learner.py", line 208, in predict
y_pred_proba = self.predict_proba(
File "/usr/local/lib/python3.10/site-packages/autogluon/tabular/learner/abstract_learner.py", line 189, in predict_proba
y_pred_proba = self.load_trainer().predict_proba(X, model=model)
File "/usr/local/lib/python3.10/site-packages/autogluon/core/trainer/abstract_trainer.py", line 773, in predict_proba
return self._predict_proba_model(X, model, cascade=cascade)
File "/usr/local/lib/python3.10/site-packages/autogluon/core/trainer/abstract_trainer.py", line 2525, in _predict_proba_model
return self.get_pred_proba_from_model(model=model, X=X, model_pred_proba_dict=model_pred_proba_dict, cascade=cascade)
File "/usr/local/lib/python3.10/site-packages/autogluon/core/trainer/abstract_trainer.py", line 787, in get_pred_proba_from_model
model_pred_proba_dict = self.get_model_pred_proba_dict(X=X, models=models, model_pred_proba_dict=model_pred_proba_dict, cascade=cascade)
File "/usr/local/lib/python3.10/site-packages/autogluon/core/trainer/abstract_trainer.py", line 1022, in get_model_pred_proba_dict
model = self.load_model(model_name=model_name)
File "/usr/local/lib/python3.10/site-packages/autogluon/core/trainer/abstract_trainer.py", line 1651, in load_model
return model_type.load(path=os.path.join(self.path, path), reset_paths=self.reset_paths)
File "/usr/local/lib/python3.10/site-packages/autogluon/core/models/abstract/abstract_model.py", line 1096, in load
model = load_pkl.load(path=file_path, verbose=verbose)
File "/usr/local/lib/python3.10/site-packages/autogluon/common/loaders/load_pkl.py", line 43, in load
with compression_fn_map[compression_fn]["open"](validated_path, "rb", **compression_fn_kwargs) as fin:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp4katc9my/models/DirectTabular/W2/tabular_predictor/models/LightGBM/model.pkl'
The temporary path which it is trying to load from is the path in which it was trained in another cloud instance. The way we are persisting these models is by uploading the full model directory, then downloading it into a new path, and attempting to call TimeSeriesPredictor.load(new_path).
Steps to reproduce
- Train a TimeSeriesPredictor with
DirectTabular
enabled - Move the model directory to a different location, ensuring the original path no longer exists
- Load from the new directory
- Predict using
DirectTabular
(or theWeightedEnsemble
)
The Predictor will try to load the model from the old directory.
This issue only arises with the Tabular models.
Installed Versions
INSTALLED VERSIONS
------------------
date : 2024-04-24
time : 12:29:19.525968
python : 3.10.13.final.0
OS : Linux
OS-release : 5.15.0-1052-azure
Version : #60-Ubuntu SMP Mon Nov 6 10:08:16 UTC 2023
machine : x86_64
processor : x86_64
num_cores : 32
cpu_ram_mb : 257926.4453125
cuda version : None
num_gpus : 0
gpu_ram_mb : []
avail_disk_size_mb : 33082
accelerate : 0.21.0
autogluon : None
autogluon.common : 1.1.0
autogluon.core : 1.1.0
autogluon.features : 1.1.0
autogluon.tabular : 1.1.0
autogluon.timeseries: 1.1.0
boto3 : 1.34.90
catboost : 1.2.5
fastai : None
gluonts : 0.14.3
hyperopt : 0.2.7
imodels : None
joblib : 1.4.0
lightgbm : 3.3.5
lightning : 2.1.4
matplotlib : 3.8.4
mlforecast : 0.10.0
networkx : 3.3
numpy : 1.25.2
onnxruntime-gpu : None
optimum : None
optimum-intel : None
orjson : 3.10.1
pandas : 2.0.3
psutil : 5.9.8
pytorch-lightning : 2.1.4
ray : 2.10.0
requests : 2.31.0
scikit-learn : 1.4.0
scikit-learn-intelex: None
scipy : 1.12.0
setuptools : 69.5.1
skl2onnx : None
statsforecast : 1.4.0
tabpfn : None
tensorboard : 2.16.2
torch : 2.1.2
tqdm : 4.66.2
transformers : 4.38.2
utilsforecast : 0.0.10
vowpalwabbit : None
xgboost : 1.7.6
Somewhere there is an absolute path being used instead of a relative path. Have you seen this issue before?
I see for TabularPredictor there are some cloning for deployment methods, and also in autogluon.cloud there is functionality to persist models into s3. Is there a recommended way to do this on our own (using azure blob storage).
Thank you