Skip to content

Adjust auto microbatch size when hitting cuda alloc retries #1998

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

mvpatel2000
Copy link
Contributor

@mvpatel2000 mvpatel2000 commented Feb 24, 2023

What does this PR do?

If a run is repeatedly hitting cuda alloc retries, throughput will collapse due to memory thrashing. I finally found a reliable repro of this, which has let me update auto grad accum to catch this case and lower the microbatch size to fix it.

Along for the ride, correctly resets grad scaler which was missing in original implementation

Blue is before, green is after
image

What issue(s) does this change relate to?

CO-1827

@mvpatel2000
Copy link
Contributor Author

Looking for help here... don't understand test failure

@@ -2142,6 +2158,10 @@ def _train_batch(self, use_grad_scaling: bool) -> Dict[str, torch.Tensor]:
xm.optimizer_step(optimizer, barrier=True)
else:
optimizer.step()
# Raise error if automicrobatching and num_alloc_retries increased, as thrashing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is error correct here? I'm just wondering how sure we are that any retries is always an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're not sure, we could add a trainer arg num_cuda_alloc_retries_allowed or whatever, default it to zero. that way at least a user can escape this error if they want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this, I think we might want some loose heuristic or something softer than just has an alloc_retry to raise an error. I've seen instances where alloc_retries just happens once during training and it's fine otherwise (so there's borderline no throughput impact).

Copy link
Contributor

@bcui19 bcui19 Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

e.g. this model hit alloc_retry once, and never hit it again, without any interventions on batch_size or parameters.

@@ -257,6 +268,8 @@ def _adjust_grad_accum(state: State, device_batch_size: int):
del state.loss
for optimizer in state.optimizers:
optimizer.zero_grad(set_to_none=True)
if state.scaler is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do/why is it related to this PR?

@mvpatel2000
Copy link
Contributor Author

Closing because of the graph Brandon showed -- this doesn't always work

@mvpatel2000 mvpatel2000 deleted the mvpatel2000/retry branch February 24, 2023 18:02
@mvpatel2000 mvpatel2000 mentioned this pull request Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants