-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[AutoMM] Bag of Tricks #4737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AutoMM] Bag of Tricks #4737
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@Innixma @tonyhoo @FANGAreNotGnu @suzhoum Please help review and approve. Please also help add the necessary labels to run multi-gpu jobs since my developer permission was revoked by someone. George suggested adding the bag of tricks into AutoMM. The corresponding paper describing the details will be published soon. This PR is large due to the unexpected backlog. |
Job PR-4737-2e80a51 is done. |
Great work! Just really a curious question (not as a blocker in anyway but just a question): is it really needed to change the parameter names ("optimization -> optim", "learning rate -> lr" etc)? As this will make the updated AutoMM library not backward compatible. |
Great observation and question! To better implement the bag of tricks, we have refactored several places in the data preprocessor and processors, which are not backward compatible. Therefore, we also took this opportunity to pay tech debts and optimize some designs including the config name changes. As the optimization configs have multi-level hyperparameters such as |
Got it. Great @zhiqiangdon ! |
Thanks @zhiqiangdon! Taking a look at this now as I was at NeurIPS last week. Do you have a recommendation for how we should benchmark this PR? Will the existing This PR also adds quite a few changes and thus would be challenging to review (>10k LoC changed across 188 files). Do you have a recommendation for how we should approach this PR for review? |
Given the changes / additions to the hyperparameters / params that a user may specify, it would be good if we can write a migration guide so users can know how to modify existing scripts to be compatible with this version of the code. For example, mentioning to change |
labels: np.ndarray, | ||
): | ||
weighted_ensemble = EnsembleSelection( | ||
ensemble_size=self._ensemble_size * len(predictions), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why * len(predictions)
? This probably isn't necessary and would slow down the training a lot. I'd recommend just always using a smallish value such as ensemble_size=40
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len(predictions)
is the number of models, not the number of samples. For example, if we train 10 models, then it would be 10. ensemble_size
is a multiplier on this number, making the actual ensemble size adaptable to the number of models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was aware of len(predictions) being the # of models, but I'll mention that in my testing with TabRepo, there is virtually zero benefit going beyond ensemble_size=40
. In fact, going beyond this value is harmful due to validation overfitting on small datasets, and you are more likely to lose the sparseness property as you have very high ensemble_size, which will slow down your inference speed.
Refer to fig 4 here: https://github.com/autogluon/tabrepo/blob/main/data/sensitivity.png
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, it isn't necessary to use very small values. If I'm not mistaken, with 2 models the code currently only uses ensemble_size=4. This will be suboptimal, as 4 isn't enough to get good weights between the 2 models. I'd recommend always using between 25 and 40, but selecting a single value for simplicity. Tabular uses 40.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion! We used 2 * model_num
in previous experiments. We can try 40 later to see whether it can improve performance in follow-up PRs.
ensemble_mode | ||
The mode of conducting ensembling: | ||
- `one_shot`: the classic ensemble selection | ||
- `sequential`: iteratively calling the classic ensemble selection with each time growing the model zoo by the best next model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sequential
is a cool idea! It might make the most sense for this to be added in core
so it isn't exclusive to AutoMM? That way we could simply call EnsembleSelection(mode="sequential")
to do it in Tabular and TimeSeries as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: can be a follow-up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not exclusive to AutoMM. We can integrate it into EnsembleSelection later if needed.
Good point. I have updated the PR description by adding the migration guideline. |
@@ -4,4 +4,4 @@ AutoGluon_multimodal_best: | |||
params: # MultimodalPredictor.fit(params) # can add the actual job time limit here | |||
presets: best_quality | |||
hyperparameters: | |||
optimization.max_epochs: 10 # 10 is default | |||
optim.max_epochs: 10 # 10 is default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question: minor bug fixes / updates have been made over the last several months in AutoMM mainline. Does this PR incorporate those changes as well (via rebasing during the period of development), or does the PR overwrite them?
If it rebased on them, were there any major merge conflicts resolved that are part of this PR worth mentioning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is based on the most recent AutoMM mainline. I didn't notice other major conflicts except for the configuration changes.
Good questions. Here's some context and recommendations: Regarding benchmarking, the best_quality preset currently trains only a single model. To incorporate the bag of tricks, you’ll need to set use_ensemble=True when initializing the predictor. At this point, autogluon-bench does not support benchmarking the bag of tricks due to the additional setup required for datasets and certain model checkpoints (e.g., meta-transformer). For now, I suggest referring to the results documented here. This PR is large because of management issues earlier this year that led to an unexpected backlog. All unit tests and tutorial builds have passed successfully. To proceed, you can benchmark the three existing presets (single model without bag of tricks) and compare the results with previous releases. Many parts of this PR involving configuration changes should be straightforward to skim through. |
While referring to the paper results is good, it is still results from a snapshot in time that doesn't necessarily represent the state of the system in practice today. The current idea I have is that once a few more people are back from vacation in mid Jan, we can run a benchmark to verify stability and performance with the presets. If it looks good and there aren't any other major concerns that come up, we can look to merge. This could take some time though, as the changes are significant. |
Job PR-4737-23f5671 is done. |
Hi @zhiqiangdon @Innixma, to address your comments, I have reprocessed the results and attaching here. I have reverted the negative signs for rmse, and I have removed single modality datasets from the mainline benchmark, so there are in total 10 multi-modality datasets (2 image-tabular, 1 image-text, 2 image-text-tabular, 5 text-tabular) for evaluation. There was however one benchmark missing from the mainline_high_quality, but it should not affect the overall performance evaluation much. |
@suzhoum can you compare all 6 in the same table? I'm curious if ex: |
In general results look very good, nice work @zhiqiangdon! And thanks for running the benchmarks @suzhoum! I think we are nearly there for merging. The next steps are to minimize friction for existing users. @zhiqiangdon The renaming of hyperparameters should include backwards compatibility, at least until v1.4 release. Can you create a separate PR that points to this PR (so it is a new branch building on-top of this PR's branch), which adds the backwards compatibility for hyperparameters? I mention a separate PR so it is easy to review isolated from the other changes in this PR. This way we won't have a gap period between merging this PR and adding the backwards compatibility where existing users who are running AutoGluon via source install will have their code broken due to the breaking change. You can consider this PR to be mostly fixed, I don't expect to request much changes to it before merge. |
re backwards compatibility, one idea is to check the dictionary specified by the user at the start of the I had more thoughts on this earlier here: #4737 (comment) |
Please see attached. |
Thanks @suzhoum! This seems to indicate that en_medium > main_best by a notable margin. If we can resolve respecting time limit while maintaining most of the benefit in a future PR, this could be a compelling option for users. |
@Innixma I have added the backward compatibility in this PR. Opening another PR doesn't seem to save time. You can check the logic through the new commit. |
Job PR-4737-d89efbe is done. |
for old_k, new_k in key_pairs.items(): | ||
if k == old_k: | ||
overrides[new_k] = overrides.pop(provided_k) | ||
logger.warning( | ||
f"The hyperparameter name {provided_k} is depreciated. " | ||
f"We recommend using the new name {new_k} instead." | ||
) | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if k in key_pairs:
is simpler and avoids the for loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Adopted the suggestion.
@@ -812,3 +813,59 @@ def update_ensemble_hyperparameters( | |||
hyperparameters = presets_hyperparameters | |||
|
|||
return hyperparameters | |||
|
|||
|
|||
def make_overrides_backward_compatible(overrides: Dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we do a overrides = copy.deepcopy(overrides)
at the start of this function to avoid in-place mutation of the user's dictionary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A copy.deepcopy
is applied before calling function apply_omegaconf_overrides
, so no need to copy them again. See line 188 here: https://github.com/autogluon/autogluon/pull/4737/files#diff-a10c8783a894991a02709f19d3b1a2653731c04d798ec8b5234b0c9dc2d55bcdR188
logger.warning( | ||
f"The hyperparameter name {provided_k} is depreciated. " | ||
f"We recommend using the new name {new_k} instead." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.warning( | |
f"The hyperparameter name {provided_k} is depreciated. " | |
f"We recommend using the new name {new_k} instead." | |
) | |
logger.warning( | |
f"The hyperparameter name {provided_k} is depreciated. " | |
f"We recommend using the new name {new_k} instead. " | |
f"The deprecated hyperparameter will raise an exception starting in AutoGluon 1.4.0" | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@zhiqiangdon thanks! I added comments to the backward compatibility code |
Job PR-4737-5c75857 is done. |
@zhiqiangdon can you resolve the minor merge conflict? Once resolved, I think we are good to merge. |
hyperparameters={"AG_AUTOMM": {"env.num_workers": 0, "env.num_workers_evaluation": 0}}, | ||
time_limit=60, | ||
hyperparameters={"AG_AUTOMM": {"env.num_workers": 0, "env.num_workers_inference": 0}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep the time_limit=60
, otherwise this takes arounds 10 minutes which slows down the CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
Job PR-4737-0787513 is done. |
@Innixma The conflict was resolved, and tests have passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks @zhiqiangdon for the major improvements and cleanup, as well as the quick follow-ups!
Issue #, if available:
Description of changes:
This PR implemented a bag of tricks for multimodal AutoML involving multimodal model fusion strategies, multimodal data augmentation, cross-modal alignment, converting tabular data into text, handling missing modalities with modality dropout and learnable embeddings, and an ensemble learner to integrate the bag of tricks for optimal performance. To better implement these tricks, we also refactored AutoMM's codebase accordingly. For technical details, see this doc
Here is one example of using bag of tricks:
For users who used AutoMM previously, please change your configurations as bellow. No action is needed if you didn’t customize these hyperparameters.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.