getting nans with t5-large + fix

## Environment info

- `transformers` version: 4.5.0.dev0
- Platform: Linux-4.15.0-65-generic-x86_64-with-glibc2.10
- Python version: 3.8.8
- PyTorch version (GPU?): 1.7.1+cu101 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No

### Who can help
@patil-suraj @patrickvonplaten 

## Information

Model I am using (Bert, XLNet ...): t5-large

The problem arises when using:
* [ ] my own modified scripts: run_seq2seq with minor modifications (attached)


The tasks I am working on is:
* [ ] my own task or dataset: Closed-Book Open Domain QA

## To reproduce

Steps to reproduce the behavior (the fix I'm suggesting is very simple, so perhaps there is no reason to reproduce):
1. unzip the attached zip (below).
2. run 
```bash
python run_seq2seq.py --model_name_or_path=t5-large
--do_train
--do_eval
--task=qa
--train_file=data/PAQ.filtered.regular.16000.json
--validation_file=data/PAQ.filtered.regular.16000.json
--output_dir=results/5e-5-t5-large-4096000-128-140-1792000-0.1-regular-true-4
--overwrite_output_dir
--per_device_train_batch_size=1
--per_device_eval_batch_size=128
--predict_with_generate
--fp16
--max_steps=1000
--evaluation_strategy=steps
--text_column=question
--summary_column=answer
--save_total_limit=5
--cache_dir=../.cache
--save_steps=500000
--learning_rate=5e-5
--eval_steps=96000
--warmup_steps=100
--run_name=5e-5-t5-large-4096000-128-140-1792000-0.1-regular-true-4
--dropout_rate=0.1
--gradient_accumulation_steps=1
--logging_steps=1
```

## Expected behavior
Training without nans. 

## Possible fix
I debugged and saw that we get nans at the `modeling_t5.py` script in line 241 
```python
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
```
By modifing this line to:
```python
clamp_value = torch.finfo(hidden_states.dtype).max - 1000
hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value) * torch.rsqrt(variance + self.variance_epsilon)
```
It seems to be solved.

BTW it happens in the last layers (this might explain why it wasn't caught in [this fix](https://discuss.huggingface.co/t/t5-fp16-issue-is-fixed/3139))

[seq2seq.zip](https://github.com/huggingface/transformers/files/6177063/seq2seq.zip)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

getting nans with t5-large + fix #10830

Environment info

Who can help

Information

To reproduce

Expected behavior

Possible fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

getting nans with t5-large + fix #10830

Description

Environment info

Who can help

Information

To reproduce

Expected behavior

Possible fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions