Adjust auto microbatch size when hitting cuda alloc retries #1998

mvpatel2000 · 2023-02-24T04:45:36Z

What does this PR do?

If a run is repeatedly hitting cuda alloc retries, throughput will collapse due to memory thrashing. I finally found a reliable repro of this, which has let me update auto grad accum to catch this case and lower the microbatch size to fix it.

Along for the ride, correctly resets grad scaler which was missing in original implementation

Blue is before, green is after

What issue(s) does this change relate to?

CO-1827

…nto mvpatel2000/retry

mvpatel2000 · 2023-02-24T06:17:22Z

Looking for help here... don't understand test failure

composer/algorithms/seq_length_warmup/seq_length_warmup.py

…nto mvpatel2000/retry

dakinggg · 2023-02-24T07:17:33Z

composer/trainer/trainer.py

@@ -2142,6 +2158,10 @@ def _train_batch(self, use_grad_scaling: bool) -> Dict[str, torch.Tensor]:
                                    xm.optimizer_step(optimizer, barrier=True)
                                else:
                                    optimizer.step()
+                # Raise error if automicrobatching and num_alloc_retries increased, as thrashing


Is error correct here? I'm just wondering how sure we are that any retries is always an error.

if we're not sure, we could add a trainer arg num_cuda_alloc_retries_allowed or whatever, default it to zero. that way at least a user can escape this error if they want.

+1 to this, I think we might want some loose heuristic or something softer than just has an alloc_retry to raise an error. I've seen instances where alloc_retries just happens once during training and it's fine otherwise (so there's borderline no throughput impact).

e.g. this model hit alloc_retry once, and never hit it again, without any interventions on batch_size or parameters.

dakinggg · 2023-02-24T07:21:15Z

composer/trainer/trainer.py

@@ -257,6 +268,8 @@ def _adjust_grad_accum(state: State, device_batch_size: int):
        del state.loss
    for optimizer in state.optimizers:
        optimizer.zero_grad(set_to_none=True)
+    if state.scaler is not None:


what does this do/why is it related to this PR?

mvpatel2000 · 2023-02-24T18:01:43Z

Closing because of the graph Brandon showed -- this doesn't always work

mvpatel2000 added 10 commits February 23, 2023 17:16

add logs

955af68

add retry logic

c6e0080

add retry logic

62dd45e

add retry logic

1ecc898

add update

e52ab81

reset scaler

4aa869c

reset scaler

d030e9e

add

d15c320

add in missing spot

3950c3d

remove print

e1a7c20

mvpatel2000 requested review from dskhudia and nik-mosaic as code owners February 24, 2023 04:45

mvpatel2000 added 3 commits February 23, 2023 20:52

Merge branch 'dev' into mvpatel2000/retry

185e70c

add helper fn

fea105d

Merge branch 'mvpatel2000/retry' of github.com:mvpatel2000/composer i…

bd3a04e

…nto mvpatel2000/retry

mvpatel2000 requested review from dakinggg and eracah February 24, 2023 05:07

Merge branch 'dev' into mvpatel2000/retry

d5bdd5d

dskhudia reviewed Feb 24, 2023

View reviewed changes

composer/algorithms/seq_length_warmup/seq_length_warmup.py Outdated Show resolved Hide resolved

mvpatel2000 added 2 commits February 23, 2023 22:58

adjust to 0

450f983

Merge branch 'mvpatel2000/retry' of github.com:mvpatel2000/composer i…

68c46d6

…nto mvpatel2000/retry

dakinggg reviewed Feb 24, 2023

View reviewed changes

mvpatel2000 closed this Feb 24, 2023

mvpatel2000 deleted the mvpatel2000/retry branch February 24, 2023 18:02

mvpatel2000 mentioned this pull request Feb 24, 2023

Reset scaler state #1999

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adjust auto microbatch size when hitting cuda alloc retries #1998

Adjust auto microbatch size when hitting cuda alloc retries #1998

Uh oh!

mvpatel2000 commented Feb 24, 2023 •

edited

Loading

Uh oh!

mvpatel2000 commented Feb 24, 2023

Uh oh!

Uh oh!

dakinggg Feb 24, 2023

Uh oh!

dakinggg Feb 24, 2023

Uh oh!

bcui19 Feb 24, 2023

Uh oh!

bcui19 Feb 24, 2023 •

edited

Loading

Uh oh!

dakinggg Feb 24, 2023

Uh oh!

mvpatel2000 commented Feb 24, 2023

Uh oh!

Uh oh!

Adjust auto microbatch size when hitting cuda alloc retries #1998

Adjust auto microbatch size when hitting cuda alloc retries #1998

Uh oh!

Conversation

mvpatel2000 commented Feb 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

What issue(s) does this change relate to?

Uh oh!

mvpatel2000 commented Feb 24, 2023

Uh oh!

Uh oh!

dakinggg Feb 24, 2023

Choose a reason for hiding this comment

Uh oh!

dakinggg Feb 24, 2023

Choose a reason for hiding this comment

Uh oh!

bcui19 Feb 24, 2023

Choose a reason for hiding this comment

Uh oh!

bcui19 Feb 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dakinggg Feb 24, 2023

Choose a reason for hiding this comment

Uh oh!

mvpatel2000 commented Feb 24, 2023

Uh oh!

Uh oh!

mvpatel2000 commented Feb 24, 2023 •

edited

Loading

bcui19 Feb 24, 2023 •

edited

Loading