-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[DSD] Correctly handle _extra_state #125336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125336
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit a4d47b1 with merge base 746da87 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
curr_obj = getattr(curr_obj, curr_obj_name) | ||
if curr_obj_name == nn.modules.module._EXTRA_STATE_KEY_SUFFIX: | ||
if i != len(obj_names) - 1: | ||
raise RuntimeError("Expect `_extra_state` to be the last obj name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
raise RuntimeError("Expect `_extra_state` to be the last obj name") | |
raise ValueError("Expect `_extra_state` to be the last obj name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the API that receives an incorrect argument but the traverse logic is wrong. So ValueError doesn't seem to be more close to the meaning. AssertionError may be more close. I'll change in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@pytorchbot merge -f "The failing tests are not related." |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined. Pull Request resolved: pytorch#125336 Approved by: https://github.com/LucasLLC ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335
* [DSD] Correctly handle _extra_state (#125336) Summary: distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined. Pull Request resolved: #125336 Approved by: https://github.com/LucasLLC ghstack dependencies: #125333, #125501, #125334, #125335 * lint * lint --------- Co-authored-by: Chien-Chin Huang <chienchin@fb.com> Co-authored-by: Andrey Talman <atalman@fb.com>
…ytorch#125337) Summary: Fixes pytorch#122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: pytorch#125337 Approved by: https://github.com/awgu ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335, pytorch#125336
…125337) (#127219) * [DSD] Fix to remove non_persistent buffer in distributed state dict (#125337) Summary: Fixes #122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: #125337 Approved by: https://github.com/awgu ghstack dependencies: #125333, #125501, #125334, #125335, #125336 * lintrunner * lint --------- Co-authored-by: Chien-Chin Huang <chienchin@fb.com> Co-authored-by: Andrey Talman <atalman@fb.com>
Hey @fegin! For this PR, we noticed there was a regression for loading state dict modules of models with tied weight embeddings. Here is the link to the broken test. Error traceback:
We believe adding this fix in our monkey patch of the _verify_option function fixes things.
Curious if you can help us upstream this composer bug fix into pytorch as well? |
Stack from ghstack (oldest at bottom):
Summary:
distributed_state_dict should not try to use
getattr
to get_extra_state
as this is not well-defined.cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC