Skip to content

Conversation

2015aroras
Copy link
Collaborator

@2015aroras 2015aroras commented Sep 12, 2024

Issue: For some runs we have observed that GPU 0 has higher memory consumption than other GPUs.

Fix: As proposed by @epwalsh, setting the CUDA device before initializing the process group appears to help. I have not confirmed that GPU memory becomes equal between GPUs (because of wandb issues), but have seen reduced memory consumption in GPU 0. There shouldn't be any harm to making this change, even if it is not 100% confirmed.

@2015aroras 2015aroras marked this pull request as ready for review September 12, 2024 18:12
@2015aroras 2015aroras requested a review from epwalsh September 12, 2024 18:13
@2015aroras 2015aroras merged commit d2b655a into main Sep 13, 2024
10 of 11 checks passed
@2015aroras 2015aroras deleted the shanea/set-device-early branch September 13, 2024 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants