[DSD] Fix set_optimizer_state_dict() changes the parameters with some optimizers #125708

fegin · 2024-05-07T19:34:29Z

Stack from ghstack (oldest at bottom):

Summary:
Some optimizers, like AdamW, change the parameters even if gradients are zero. So set_optimizer_state_dict() may affect the parameters values with these optimizers. This PR fixes the issue.

This PR also fixes #121186.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC

[ghstack-poisoned]

pytorch-bot · 2024-05-07T19:34:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125708

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 3d827e5 with merge base 196a0b1 ():

NEW FAILURE - The following job has failed:

trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral) (gh)
RuntimeError: inductor/test_cutlass_backend 1/1 failed

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

wz337

LGTM!

fegin · 2024-05-08T06:55:24Z

@pytorchbot merge -f "The failing tests are not related."

pytorchmergebot · 2024-05-08T06:57:09Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…5338) Summary: This is useful if users would like to avoid CPU memory OOM when loading from a full state_dict. Pull Request resolved: #125338 Approved by: https://github.com/weifengpy ghstack dependencies: #125708

…25339) Summary: This is useful if users would like to avoid CPU memory OOM when loading from a full state_dict. Pull Request resolved: #125339 Approved by: https://github.com/weifengpy ghstack dependencies: #125708, #125338

Craigacp · 2024-06-18T21:20:45Z

Is there going to be a PyTorch 2.3.2, and if so would it be possible to get this fix in it? I've spent all day running down slight parameter differences in my model when loading checkpoints as this is called in get_optimizer_state_dict which is necessary to get the optimizer dictionary to load into with dist_cp.load. When I loaded in the optimizer checkpoint it changed the freshly loaded model checkpoint because it stepped an empty optimizer so weight decayed all my parameters.

Update

dade0e0

[ghstack-poisoned]

fegin mentioned this pull request May 7, 2024

[DSD] Implement broadcast_from_rank0 option for model state_dict #125338

Closed

pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 7, 2024

fegin mentioned this pull request May 7, 2024

[DSD] Implement broadcast_from_rank0 option for optim state_dict #125339

Closed

fegin requested a review from wz337 May 7, 2024 19:38

Update

3d827e5

[ghstack-poisoned]

wz337 approved these changes May 7, 2024

View reviewed changes

fegin added the ciflow/trunk Trigger trunk jobs on your pull request label May 7, 2024

pytorchmergebot added the merging label May 8, 2024

pytorchmergebot added the Merged label May 8, 2024

pytorchmergebot closed this in 88fbe79 May 8, 2024

pytorchmergebot removed the merging label May 8, 2024

github-actions bot deleted the gh/fegin/239/head branch June 8, 2024 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DSD] Fix set_optimizer_state_dict() changes the parameters with some optimizers #125708

[DSD] Fix set_optimizer_state_dict() changes the parameters with some optimizers #125708

Uh oh!

fegin commented May 7, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 7, 2024 •

edited

Loading

Uh oh!

wz337 left a comment

Uh oh!

fegin commented May 8, 2024

Uh oh!

pytorchmergebot commented May 8, 2024

Uh oh!

Craigacp commented Jun 18, 2024 •

edited

Loading

Uh oh!

Uh oh!

[DSD] Fix set_optimizer_state_dict() changes the parameters with some optimizers #125708

[DSD] Fix set_optimizer_state_dict() changes the parameters with some optimizers #125708

Uh oh!

Conversation

fegin commented May 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125708

❌ 1 New Failure

Uh oh!

wz337 left a comment

Choose a reason for hiding this comment

Uh oh!

fegin commented May 8, 2024

Uh oh!

pytorchmergebot commented May 8, 2024

Merge started

Uh oh!

Craigacp commented Jun 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fegin commented May 7, 2024 •

edited

Loading

pytorch-bot bot commented May 7, 2024 •

edited

Loading

Craigacp commented Jun 18, 2024 •

edited

Loading