-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[ENH] Save/Load Functionality for both Classical and DL Estimators #3425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Following is a small code-snippet to showcase how "states are preserved" / "contract is maintained": import sktime.classification.deep_learning as dl
from sktime.base import load, load_dl
import numpy as np
from pprint import pprint
from pathlib import Path
if __name__ == "__main__":
save_folder_location = "temp_saved"
train_X = np.random.randn(15, 24, 16)
train_y = np.random.randint(0, 2, size=(15, ))
test_X = np.random.randn(5, 24, 16)
cnn = dl.cnn.CNNClassifier(n_epochs=1)
cnn.fit(train_X, train_y)
cnn.save(save_folder_location)
new_cnn = load_dl(save_folder_location)
pprint(cnn.__dict__)
pprint(cnn.model_.optimizer)
pprint(cnn.optimizer)
pprint(cnn.optimizer_)
pprint(cnn.history)
print("--" * 50)
pprint(new_cnn.__dict__)
pprint(new_cnn.model_.optimizer)
pprint(new_cnn.optimizer)
pprint(new_cnn.optimizer_)
pprint(new_cnn.history)
print("Does both give same result to same data?", np.allclose(
cnn.predict(X=test_X),
new_cnn.predict(X=test_X)
)) The output:
@AurumnPegasus I would also like for you to code-review this PR and give some suggestion for possible tests. |
Hey @achieveordie , I went through your PR and I think the code is good. I agree with the About the tests, there were a few I had in mind:
|
Thanks for the review! My only problem with a separate As for tests for the current implementation, I agree with your points. However, the idempotent test seems to be testing for fitting model twice as opposed to save/load. I might be wrong since I just glanced at it. For the second point, I was wondering if there is already an implementation that I could look at in the existing tests? Edit: Alternatively, I think you might be referring to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea looks good, but we need to get to a state where there is only a single load
function. Why: think of the workflow, you are breaking the strategy pattern. This would introduce a case distinction of selecting the right "load" file, this does not scale well and gets confusing. Knowledge of what estimator is in the file/directory should not be assumed.
All specific functionality for loading objects should be localized in an override of one of the load_
methods of the class. The idea is that load
is universal, and any specific functionality sits in the class. This is achieved by a file format that wraps the class and a custom store.
This looks quite easy to do, and should not require user to know anything about either the estimator or the saved file/directory. Maybe the python command |
This is the same idea that I was pursuing, I was just attatching these as overrides to the class they are relevant for. This is following the strategy oop pattern, with one of the goals to localize the extension burden to the class. Otherwise, if you add new estimators with specific loading, you have two loci of extension - the So, yes, but
|
If we add new estimators that require a different logic than already provided in base classes(having custom objects for example), then the author would have to expand the class method |
Hm, can you explain why you added more code in My concern is that subject to your design, the same might be necessary if, say, we add a Though I admit I have perhaps not fully understood how you see the extension contract, so I'm happy to be convinced that this is the last time something has to be added to |
I've updated the STEP to clarify what I had in mind: |
Perhaps we don't understand each other. The additional code is due to the fact that non-DL and DL estimators are being stored in different ways. For the sake of discussion, if we were to save non-DL estimators in a directory (not zip them), then we can get rid of Going from zip to directory-based for non-DL estimators seems easy, whereas going from directory to file-based storage (for DL estimators) needs more exploration. It might be easy/difficult/impossible. We need to discuss this further. |
Yes, I suppose so.
I'm not saying "there should not be additional/different code for DL estimators". I'm saying "there will likely be different/additional code, and it should live in the base class of the estimator, not in the
Yes, let's chat tomorrow when we are meeting anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, nice!
I think this solves the extensibility issue neatly.
Didn't know shutil
until now.
Will merge this post 0.13.3 so we have some time for testing, and since it would make it 0.14.0 (major interface addition).
question: is there a way to avoid the creation of the temp folder? |
Wouldn't this be a problem with |
I was just thinking to minimize file write/delete actions, as heuristic detection may find more steps more odd, but that's just a gut feeling. |
After taking care of the issue with
If you think that either of them is in the scope of this PR, LMK, otherwise one can skip these tests and work on it in another PR. |
@fkiraly It seems that the only errors remaining are the Do leave some comments about the added tests for base deep classifiers, these tests were meant to check if |
Thanks! looks great! Did you check (by deactivating |
from zipfile import ZipFile | ||
|
||
if path is None: | ||
import h5py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to make sure we have proper isolation of soft dependencies.
This would simply break at use and test.
For use, I would put a check and an error message, i.e., "in-memory serialization requires h5py
" or similar.
For the testing, we need to think about what to skip when.
The serialization test (save/load to memory) will fail without h5py
. Since it is part of the standard test suite, this creates a problem for which I can't see a nice solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(the general requirement being, the full test suite of sktime should run without failures with any set of soft dependencies installed, the problematic case would arise if you have tensorflow
installed but not h5py
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tensorflow
is also a part of the soft dependencies, right? Or we do not consider it since it's also included in dl
package?
These look like genuine bugs to me, i.e., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summarizing remaining items afaik:
- we need to deal somehow with the issue that test suite should not break for a user who has
tensorflow
but noth5py
installed, and it still should have maximal possible coverage - handling
fit_test_idempotent
AssertionError
(see above) - answer to "do we really need a special save/load now" below
Tests look good to me! Stupid question: now that we got set/getstate to work, and therefore |
If you remember in one of the conversations I mentioned that my system fails to run the entire suite. It just hangs till all the worker crash. So I couldn't try this out. I think that all the deep estimators are affected (see
I don't know the feasibility of doing this here's a possible solution - We create a fixture which runs only when certain soft-deps are included in the built. Otherwise, the tests in this fixture are just skipped. For now, we might have to skip all those tests regardless.
Yes. We are deleting the most essential part of the object - the |
So if I were to re-iterate, our revised remaining items will be:
If we agree, then I can move forward and make these changes in the next commit. |
Not sure what you mean by this, can you explain?
I thought with That would mean that
Yes |
As you said that user's test suite may not have soft dependencies installed. So we can skip those tests. Specifically, any test that uses
The updated |
I'd rather only skip the parts that require
I don't get what you mean, would appreciate an explanation. |
@fkiraly I hope things are clear after our last conversation. LMK if something still needs to be cleared. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, great contribution!
I know the PR is closed and merged, but I have some opinions about the implementation. Firstly, great contribution, thanks for this @achieveordie . I was testing this in my local machine and found out that it doesn't work as expected in the composite models. For example, if we apply I think we need some propagation mechanism for this to work. I mean, if a forecaster has an |
@ilkersigirci thanks for the much-needed input. I think we were expecting to find some bugs since this has just been implemented. What I would suggest is that we should tackle this on a new issue. So maybe open a new issue (possibly with reproducible code) and let's take on it from there. Thanks! |
Ofc, I will open a bug issue. BTW, in the |
Reference Issues/PRs
Fixes #3022
What does this implement/fix? Explain your changes.
The Save/Load function for classical estimators is from #3336 and has been expanded for DL models along the same design. Keras is known to be problematic for certain kinds of (de)serialization, so this uses
model.save()
fromkeras
along withpickle.dump()
. The estimators are saved as.zip
files.Components of this directory include:
The first two components are the same as previously mentioned PR, stored differently to avoid this problem.
All the attributes are saved and restored (optimizers/history etc.) appropriately.
Does your contribution introduce a new dependency? If yes, which one?
No
What should a reviewer concentrate their feedback on?
One thing in
__getstate__()
and__setstate__()
might be a source of contention, I've explained why it was necessary and I believe it is the simplest solution.Any other comments?
PR checklist
For all contributions