[DSD] Implement broadcast_from_rank0 option for optim state_dict #125339

fegin · 2024-05-01T21:12:00Z

Stack from ghstack (oldest at bottom):

Summary:
This is useful if users would like to avoid CPU memory OOM when loading from a full state_dict.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC

[ghstack-poisoned]

pytorch-bot · 2024-05-01T21:12:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125339

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit f3bb51d with merge base 196a0b1 ():

NEW FAILURE - The following job has failed:

periodic / win-vs2019-cuda11.8-py3 / test (default, 4, 4, windows.g5.4xlarge.nvidia.gpu) (gh)
test_linalg.py::TestLinalgCUDA::test_svd_cuda_float32

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

weifengpy · 2024-05-02T00:00:57Z

test/distributed/checkpoint/test_state_dict.py

-            )
-            if equal:
-                self.assertEqual(states, fsdp_states)
+            def check(equal):


do we need to check get_optimizer_state_dict as well? in torchtune, we call model and optimizer sd separately

oh yes, somehow that was removed during rebasing. Sorry for the confusion, will add it back.

[ghstack-poisoned]

Summary: This is useful if users would like to avoid CPU memory OOM when loading from a full state_dict. ghstack-source-id: d6131de Pull Request resolved: #125339

weifengpy · 2024-05-03T17:21:13Z

torch/distributed/checkpoint/state_dict.py

@@ -683,11 +693,33 @@ def _load_optim_state_dict(
                optim_state_dict = FSDP.optim_state_dict_to_load(
                    model, optim, optim_state_dict
                )
+        elif info.broadcast_from_rank0:
+            info.full_state_dict = False
+            local_state_dict = _get_optim_state_dict(model, (optim,), info)


are we using _get_optim_state_dict instead of torch.optim.Optimizer.state_dict since we want to align FQNs between local_state_dict and optim_state_dict ? torch.optim.Optimizer.state_dict only give us ID keys

yes, it is easier to proceed with keys.

[ghstack-poisoned]

Summary: This is useful if users would like to avoid CPU memory OOM when loading from a full state_dict. ghstack-source-id: 467a730 Pull Request resolved: #125339

weifengpy · 2024-05-07T03:52:29Z

torch/distributed/checkpoint/state_dict.py

@@ -653,7 +659,11 @@ def _load_optim_state_dict(
        return

    for optim in optimizers:
-        optim_state_dict = _split_optim_state_dict(model, optim, state_dict, info)
+        _init_optim_state(optim)


_init_optim_state seems to update model.parameters() for Adam even though we set grad=0 ?

repro: pytest test_distributed.py P1233005758

Fixed with #125708

[ghstack-poisoned]

Summary: This is useful if users would like to avoid CPU memory OOM when loading from a full state_dict. ghstack-source-id: 0192056 Pull Request resolved: #125339

fegin · 2024-05-08T07:15:09Z

@pytorchbot merge

pytorchmergebot · 2024-05-08T07:17:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-05-08T07:17:24Z

Merge failed

Reason: 1 jobs have failed, first few of them are: periodic / win-vs2019-cuda11.8-py3 / test (default, 4, 4, windows.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

fegin · 2024-05-08T07:19:21Z

@pytorchbot merge -f "The failing tests are not related."

pytorchmergebot · 2024-05-08T07:22:06Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Update

1840e1d

[ghstack-poisoned]

This was referenced May 1, 2024

[PT2D] Ensure the trace rules are correct with distributed #125333

Closed

[DCP] Always create requests for non-tensor objects #125334

Closed

fegin mentioned this pull request May 1, 2024

[DCP] Always flatten mapping even if no tensors present #125335

Closed

pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 1, 2024

This was referenced May 1, 2024

[DSD] Correctly handle _extra_state #125336

Closed

[DSD] Fix to remove non_persistent buffer in distributed state dict #125337

Closed

[DSD] Implement broadcast_from_rank0 option for model state_dict #125338

Closed

fegin added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request labels May 1, 2024

fegin requested review from wz337 and LucasLLC May 1, 2024 21:31

Update

ac635e7

[ghstack-poisoned]

weifengpy reviewed May 2, 2024

View reviewed changes

Update

d980629

[ghstack-poisoned]

weifengpy mentioned this pull request May 3, 2024

enable LoRA + FSDP2 pytorch/torchtune#855

Merged

6 tasks

fegin mentioned this pull request May 3, 2024

Can't load on rank 0 only with set_optimizer_state_dict #125177

Closed

weifengpy reviewed May 3, 2024

View reviewed changes

Update

29d24e4

[ghstack-poisoned]

fegin mentioned this pull request May 3, 2024

[DSD] Improve the performance of distributed state_dict #125501

Closed

weifengpy reviewed May 7, 2024

View reviewed changes

fegin mentioned this pull request May 7, 2024

[DSD] Fix set_optimizer_state_dict() changes the parameters with some optimizers #125708

Closed

Update

df11dd6

[ghstack-poisoned]

Update

f3bb51d

[ghstack-poisoned]

weifengpy approved these changes May 7, 2024

View reviewed changes

fegin added the release notes: distributed (checkpoint) label May 8, 2024

pytorchmergebot added the merging label May 8, 2024

pytorchmergebot removed the merging label May 8, 2024

pytorchmergebot added the merging label May 8, 2024

pytorchmergebot added the Merged label May 8, 2024

pytorchmergebot closed this in 15a9770 May 8, 2024

pytorchmergebot removed the merging label May 8, 2024

github-actions bot deleted the gh/fegin/236/head branch June 8, 2024 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DSD] Implement broadcast_from_rank0 option for optim state_dict #125339

[DSD] Implement broadcast_from_rank0 option for optim state_dict #125339

Uh oh!

fegin commented May 1, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 1, 2024 •

edited

Loading

Uh oh!

weifengpy May 2, 2024

Uh oh!

fegin May 2, 2024 •

edited

Loading

Uh oh!

weifengpy May 3, 2024

Uh oh!

fegin May 3, 2024

Uh oh!

weifengpy May 7, 2024

Uh oh!

fegin May 7, 2024

Uh oh!

fegin commented May 8, 2024

Uh oh!

pytorchmergebot commented May 8, 2024

Uh oh!

pytorchmergebot commented May 8, 2024

Uh oh!

fegin commented May 8, 2024

Uh oh!

pytorchmergebot commented May 8, 2024

Uh oh!

Uh oh!

[DSD] Implement broadcast_from_rank0 option for optim state_dict #125339

[DSD] Implement broadcast_from_rank0 option for optim state_dict #125339

Uh oh!

Conversation

fegin commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125339

❌ 1 New Failure

Uh oh!

weifengpy May 2, 2024

Choose a reason for hiding this comment

Uh oh!

fegin May 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weifengpy May 3, 2024

Choose a reason for hiding this comment

Uh oh!

fegin May 3, 2024

Choose a reason for hiding this comment

Uh oh!

weifengpy May 7, 2024

Choose a reason for hiding this comment

Uh oh!

fegin May 7, 2024

Choose a reason for hiding this comment

Uh oh!

fegin commented May 8, 2024

Uh oh!

pytorchmergebot commented May 8, 2024

Merge started

Uh oh!

pytorchmergebot commented May 8, 2024

Merge failed

Uh oh!

fegin commented May 8, 2024

Uh oh!

pytorchmergebot commented May 8, 2024

Merge started

Uh oh!

Uh oh!

fegin commented May 1, 2024 •

edited

Loading

pytorch-bot bot commented May 1, 2024 •

edited

Loading

fegin May 2, 2024 •

edited

Loading