[DSD] Improve the performance of distributed state_dict #125501

fegin · 2024-05-03T21:11:32Z

Stack from ghstack (oldest at bottom):

Summary:

Remove gc.collect(), which is not necessary.
Use lru_cache to cache _get_fqns

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC

[ghstack-poisoned]

pytorch-bot · 2024-05-03T21:11:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125501

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit ceb9a04 with merge base 746da87 ():

NEW FAILURE - The following job has failed:

trunk / macos-12-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_power_function

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wz337

LGTM

LucasLLC

Approved with question

LucasLLC · 2024-05-06T21:22:25Z

torch/distributed/checkpoint/state_dict.py

@@ -132,6 +131,7 @@ class _StateDictInfo(StateDictOptions):
    fsdp_modules: List[nn.Module] = field(default_factory=list)


+@functools.lru_cache(maxsize=None)


would it be safer to set a maxsize here? otherwise this could lead to a (probably very small) memleak?

When maxsize is None, the performance is faster as this means no LRU is used. functools.cache can achieve the same behavior but it is available only from Python 3.9. Theoretically, the get_state_dict should have the module every time so the cache of _get_fqns should stabilize after the first get_state_dict.

I want to flag that this is not the case if get_state_dict is being used for different models in the same endpoint, and will likely be leaky.

fegin · 2024-05-07T16:53:06Z

@pytorchbot merge -f "The failing test is not related."

pytorchmergebot · 2024-05-07T16:54:56Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: If an object only exists on certain non-coordinator ranks, we still need to save them. Otherwise, we lose these objects. If they are duplicated, DCP will deduplicate them. Pull Request resolved: #125334 Approved by: https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #125333, #125501

Summary: Right now DCP only flatten a mapping (e.g., dict) if that mapping has tensor objects. This behavior is odd as users may save different non-tensor objects on different ranks. Without flattening the mappings, we may lose these non-tensor objects. One use case is dataloader state_dict. We may also want to do so for a list/tuple. But this will cause extra pickles. So we don't do this for now. Pull Request resolved: #125335 Approved by: https://github.com/LucasLLC, https://github.com/wz337 ghstack dependencies: #125333, #125501, #125334

Summary: distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined. Pull Request resolved: #125336 Approved by: https://github.com/LucasLLC ghstack dependencies: #125333, #125501, #125334, #125335

…125337) Summary: Fixes #122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: #125337 Approved by: https://github.com/awgu ghstack dependencies: #125333, #125501, #125334, #125335, #125336

Summary: distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined. Pull Request resolved: pytorch#125336 Approved by: https://github.com/LucasLLC ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335

* [DSD] Correctly handle _extra_state (#125336) Summary: distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined. Pull Request resolved: #125336 Approved by: https://github.com/LucasLLC ghstack dependencies: #125333, #125501, #125334, #125335 * lint * lint --------- Co-authored-by: Chien-Chin Huang <chienchin@fb.com> Co-authored-by: Andrey Talman <atalman@fb.com>

…ytorch#125337) Summary: Fixes pytorch#122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: pytorch#125337 Approved by: https://github.com/awgu ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335, pytorch#125336

…125337) (#127219) * [DSD] Fix to remove non_persistent buffer in distributed state dict (#125337) Summary: Fixes #122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: #125337 Approved by: https://github.com/awgu ghstack dependencies: #125333, #125501, #125334, #125335, #125336 * lintrunner * lint --------- Co-authored-by: Chien-Chin Huang <chienchin@fb.com> Co-authored-by: Andrey Talman <atalman@fb.com>

Update

ceb9a04

[ghstack-poisoned]

pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 3, 2024

fegin requested review from wz337 and LucasLLC May 3, 2024 21:12

fegin added the ciflow/trunk Trigger trunk jobs on your pull request label May 3, 2024

wz337 approved these changes May 6, 2024

View reviewed changes

LucasLLC approved these changes May 6, 2024

View reviewed changes

pytorchmergebot added the merging label May 7, 2024

pytorchmergebot added the Merged label May 7, 2024

pytorchmergebot closed this in 71dc157 May 7, 2024

pytorchmergebot removed the merging label May 7, 2024

mvpatel2000 mentioned this pull request May 17, 2024

[DSD] Correctly handle _extra_state (#125336) #126567

Merged

antoinebrl mentioned this pull request May 27, 2024

[DSD] Fix to remove non_persistent buffer in distributed state dict (#125337) #127219

Merged

github-actions bot deleted the gh/fegin/237/head branch June 7, 2024 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DSD] Improve the performance of distributed state_dict #125501

[DSD] Improve the performance of distributed state_dict #125501

Uh oh!

fegin commented May 3, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented May 3, 2024 •

edited

Loading

Uh oh!

wz337 left a comment

Uh oh!

LucasLLC left a comment

Uh oh!

LucasLLC May 6, 2024

Uh oh!

fegin May 7, 2024

Uh oh!

LucasLLC May 7, 2024

Uh oh!

fegin commented May 7, 2024

Uh oh!

pytorchmergebot commented May 7, 2024

Uh oh!

Uh oh!

		@@ -132,6 +131,7 @@ class _StateDictInfo(StateDictOptions):
		fsdp_modules: List[nn.Module] = field(default_factory=list)


		@functools.lru_cache(maxsize=None)

[DSD] Improve the performance of distributed state_dict #125501

[DSD] Improve the performance of distributed state_dict #125501

Uh oh!

Conversation

fegin commented May 3, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125501

❌ 1 New Failure

Uh oh!

wz337 left a comment

Choose a reason for hiding this comment

Uh oh!

LucasLLC left a comment

Choose a reason for hiding this comment

Uh oh!

LucasLLC May 6, 2024

Choose a reason for hiding this comment

Uh oh!

fegin May 7, 2024

Choose a reason for hiding this comment

Uh oh!

LucasLLC May 7, 2024

Choose a reason for hiding this comment

Uh oh!

fegin commented May 7, 2024

Uh oh!

pytorchmergebot commented May 7, 2024

Merge started

Uh oh!

Uh oh!

fegin commented May 3, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented May 3, 2024 •

edited

Loading