Skip to content

Conversation

antoinebrl
Copy link
Contributor

@antoinebrl antoinebrl commented May 27, 2024

PR for 2.3.1 cherry-picking. ⚠️ #126567 must be merged before

[DSD] Fix to remove non_persistent buffer in distributed state dict (#125337)

Summary:
Fixes #122792

state_dict includes only persistent buffers, while named_buffers() would
include non_persistent buffers.

Pull Request resolved: #125337
Approved by: https://github.com/awgu
ghstack dependencies: #125333, #125501, #125334, #125335, #125336

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC @atalman

Copy link

pytorch-bot bot commented May 27, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127219

Note: Links to docs will display an error until the docs builds have been completed.

❌ 51 New Failures, 2 Unrelated Failures

As of commit 4084339 with merge base 86a2d67 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 27, 2024
…ytorch#125337)

Summary:
Fixes pytorch#122792

state_dict includes only persistent buffers, while named_buffers() would
include non_persistent buffers.

Pull Request resolved: pytorch#125337
Approved by: https://github.com/awgu
ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335, pytorch#125336
@antoinebrl antoinebrl force-pushed the ab/ckpt-non-persistent-buffer branch from d10a6ec to e6ccb5b Compare May 27, 2024 16:19
@huydhn huydhn merged commit bd1040c into pytorch:release/2.3 May 27, 2024
@PaliC
Copy link
Contributor

PaliC commented May 31, 2024

Validated this works for cpu on the 2.3 release branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue open source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants