-
Notifications
You must be signed in to change notification settings - Fork 451
Adjust auto microbatch size when hitting cuda alloc retries #1998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looking for help here... don't understand test failure |
…nto mvpatel2000/retry
@@ -2142,6 +2158,10 @@ def _train_batch(self, use_grad_scaling: bool) -> Dict[str, torch.Tensor]: | |||
xm.optimizer_step(optimizer, barrier=True) | |||
else: | |||
optimizer.step() | |||
# Raise error if automicrobatching and num_alloc_retries increased, as thrashing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is error correct here? I'm just wondering how sure we are that any retries is always an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're not sure, we could add a trainer arg num_cuda_alloc_retries_allowed
or whatever, default it to zero. that way at least a user can escape this error if they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this, I think we might want some loose heuristic or something softer than just has an alloc_retry to raise an error. I've seen instances where alloc_retries just happens once during training and it's fine otherwise (so there's borderline no throughput impact).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -257,6 +268,8 @@ def _adjust_grad_accum(state: State, device_batch_size: int): | |||
del state.loss | |||
for optimizer in state.optimizers: | |||
optimizer.zero_grad(set_to_none=True) | |||
if state.scaler is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this do/why is it related to this PR?
Closing because of the graph Brandon showed -- this doesn't always work |
What does this PR do?
If a run is repeatedly hitting cuda alloc retries, throughput will collapse due to memory thrashing. I finally found a reliable repro of this, which has let me update auto grad accum to catch this case and lower the microbatch size to fix it.
Along for the ride, correctly resets grad scaler which was missing in original implementation
Blue is before, green is after

What issue(s) does this change relate to?
CO-1827