OOM with a lot of GPU memory left

## 🐛 Bug

When building models with transformers pytorch says my GPU does not have memory without plenty of memory being there at disposal. I have been trying to tackle this problem for some time now, I have tried switching os, lowering batch sizes etc/ Every time (both on personal machine and cluster) it gives me error like this: 

```
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 6.00 GiB total capacity; 4.26 GiB already allocated; 0 bytes free; 4.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
```
It always happens at the first/second step

There has ben similar problems but I did not find any solution for this. I also described my problem [here](https://github.com/explosion/spaCy/issues/9578) but now I think it is more of problem with pytorch.

## To Reproduce

Steps to reproduce the behavior:

1. I followed https://huggingface.co/transformers/training.html tutorial 

## Expected behavior

I would expect the tutorial to work. I expect there to be some logical answer to the problem. 

## Environment
```
PyTorch version: 1.10.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.11 (default, Nov  2 2021, 10:56:09)  [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.10.60.1-microsoft-standard-WSL2-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1660 Ti
Nvidia driver version: 510.06
cuDNN version: Probably one of the following:
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.19.5
[pip3] torch==1.10.0
[pip3] torchaudio==0.8.2
[pip3] torchvision==0.10.1
[conda] Could not collect
```

## Additional context
Putting all info if this helps: 

```
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_1721/1645165363.py in <module>
      7     for batch in train_dataloader:
      8         batch = {k: v.to(device) for k, v in batch.items()}
----> 9         outputs = model(**batch)
     10         loss = outputs.loss
     11         loss.backward()

~/.cache/pypoetry/virtualenvs/mars-48yr609M-py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

~/.cache/pypoetry/virtualenvs/mars-48yr609M-py3.8/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
   1500         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
   1501 
-> 1502         outputs = self.bert(
   1503             input_ids,
   1504             attention_mask=attention_mask,

~/.cache/pypoetry/virtualenvs/mars-48yr609M-py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

~/.cache/pypoetry/virtualenvs/mars-48yr609M-py3.8/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    969             past_key_values_length=past_key_values_length,
    970         )
--> 971         encoder_outputs = self.encoder(
    972             embedding_output,
    973             attention_mask=extended_attention_mask,

~/.cache/pypoetry/virtualenvs/mars-48yr609M-py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

~/.cache/pypoetry/virtualenvs/mars-48yr609M-py3.8/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    566                 )
    567             else:
--> 568                 layer_outputs = layer_module(
    569                     hidden_states,
    570                     attention_mask,

~/.cache/pypoetry/virtualenvs/mars-48yr609M-py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

~/.cache/pypoetry/virtualenvs/mars-48yr609M-py3.8/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
    454         # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
    455         self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
--> 456         self_attention_outputs = self.attention(
    457             hidden_states,
    458             attention_mask,

~/.cache/pypoetry/virtualenvs/mars-48yr609M-py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

~/.cache/pypoetry/virtualenvs/mars-48yr609M-py3.8/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
    385         output_attentions=False,
    386     ):
--> 387         self_outputs = self.self(
    388             hidden_states,
    389             attention_mask,

~/.cache/pypoetry/virtualenvs/mars-48yr609M-py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

~/.cache/pypoetry/virtualenvs/mars-48yr609M-py3.8/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
    317         # This is actually dropping out entire tokens to attend to, which might
    318         # seem a bit unusual, but is taken from the original Transformer paper.
--> 319         attention_probs = self.dropout(attention_probs)
    320 
    321         # Mask heads if we want to

~/.cache/pypoetry/virtualenvs/mars-48yr609M-py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

~/.cache/pypoetry/virtualenvs/mars-48yr609M-py3.8/lib/python3.8/site-packages/torch/nn/modules/dropout.py in forward(self, input)
     56 
     57     def forward(self, input: Tensor) -> Tensor:
---> 58         return F.dropout(input, self.p, self.training, self.inplace)
     59 
     60 

~/.cache/pypoetry/virtualenvs/mars-48yr609M-py3.8/lib/python3.8/site-packages/torch/nn/functional.py in dropout(input, p, training, inplace)
   1167     if p < 0.0 or p > 1.0:
   1168         raise ValueError("dropout probability has to be between 0 and 1, " "but got {}".format(p))
-> 1169     return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
   1170 
   1171 

RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 6.00 GiB total capacity; 4.26 GiB already allocated; 0 bytes free; 4.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
```


cc @ngimel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OOM with a lot of GPU memory left #67680

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOM with a lot of GPU memory left #67680

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions