Skip to content

Conversation

[ghstack-poisoned]
Copy link

pytorch-bot bot commented May 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125501

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit ceb9a04 with merge base 746da87 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@wz337 wz337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@LucasLLC LucasLLC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with question

@@ -132,6 +131,7 @@ class _StateDictInfo(StateDictOptions):
fsdp_modules: List[nn.Module] = field(default_factory=list)


@functools.lru_cache(maxsize=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be safer to set a maxsize here? otherwise this could lead to a (probably very small) memleak?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When maxsize is None, the performance is faster as this means no LRU is used. functools.cache can achieve the same behavior but it is available only from Python 3.9. Theoretically, the get_state_dict should have the module every time so the cache of _get_fqns should stabilize after the first get_state_dict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to flag that this is not the case if get_state_dict is being used for different models in the same endpoint, and will likely be leaky.

@fegin
Copy link
Contributor Author

fegin commented May 7, 2024

@pytorchbot merge -f "The failing test is not related."

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request May 7, 2024
Summary:
If an object only exists on certain non-coordinator ranks, we still need to save them. Otherwise, we lose these objects. If they are duplicated, DCP will deduplicate them.

Pull Request resolved: #125334
Approved by: https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #125333, #125501
pytorchmergebot pushed a commit that referenced this pull request May 7, 2024
Summary:
 Right now DCP only flatten a mapping (e.g., dict) if that mapping has tensor objects. This behavior is odd as users may save different non-tensor objects on different ranks. Without flattening the mappings, we may lose these non-tensor objects. One use case is dataloader state_dict.

 We may also want to do so for a list/tuple. But this will cause extra pickles. So we don't do this for now.

Pull Request resolved: #125335
Approved by: https://github.com/LucasLLC, https://github.com/wz337
ghstack dependencies: #125333, #125501, #125334
pytorchmergebot pushed a commit that referenced this pull request May 7, 2024
Summary:
distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined.

Pull Request resolved: #125336
Approved by: https://github.com/LucasLLC
ghstack dependencies: #125333, #125501, #125334, #125335
pytorchmergebot pushed a commit that referenced this pull request May 7, 2024
…125337)

Summary:
Fixes #122792

state_dict includes only persistent buffers, while named_buffers() would
include non_persistent buffers.

Pull Request resolved: #125337
Approved by: https://github.com/awgu
ghstack dependencies: #125333, #125501, #125334, #125335, #125336
mvpatel2000 pushed a commit to mvpatel2000/pytorch that referenced this pull request May 17, 2024
Summary:
distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined.

Pull Request resolved: pytorch#125336
Approved by: https://github.com/LucasLLC
ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335
atalman added a commit that referenced this pull request May 27, 2024
* [DSD] Correctly handle _extra_state (#125336)

Summary:
distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined.

Pull Request resolved: #125336
Approved by: https://github.com/LucasLLC
ghstack dependencies: #125333, #125501, #125334, #125335

* lint

* lint

---------

Co-authored-by: Chien-Chin Huang <chienchin@fb.com>
Co-authored-by: Andrey Talman <atalman@fb.com>
antoinebrl pushed a commit to antoinebrl/pytorch that referenced this pull request May 27, 2024
…ytorch#125337)

Summary:
Fixes pytorch#122792

state_dict includes only persistent buffers, while named_buffers() would
include non_persistent buffers.

Pull Request resolved: pytorch#125337
Approved by: https://github.com/awgu
ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335, pytorch#125336
huydhn pushed a commit that referenced this pull request May 27, 2024
…125337) (#127219)

* [DSD] Fix to remove non_persistent buffer in distributed state dict (#125337)

Summary:
Fixes #122792

state_dict includes only persistent buffers, while named_buffers() would
include non_persistent buffers.

Pull Request resolved: #125337
Approved by: https://github.com/awgu
ghstack dependencies: #125333, #125501, #125334, #125335, #125336

* lintrunner

* lint

---------

Co-authored-by: Chien-Chin Huang <chienchin@fb.com>
Co-authored-by: Andrey Talman <atalman@fb.com>
@github-actions github-actions bot deleted the gh/fegin/237/head branch June 7, 2024 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants