-
Notifications
You must be signed in to change notification settings - Fork 30.3k
Closed
Description
Environment info
transformers
version: 4.5.0.dev0- Platform: Linux-4.15.0-65-generic-x86_64-with-glibc2.10
- Python version: 3.8.8
- PyTorch version (GPU?): 1.7.1+cu101 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help
@patil-suraj @patrickvonplaten
Information
Model I am using (Bert, XLNet ...): t5-large
The problem arises when using:
- my own modified scripts: run_seq2seq with minor modifications (attached)
The tasks I am working on is:
- my own task or dataset: Closed-Book Open Domain QA
To reproduce
Steps to reproduce the behavior (the fix I'm suggesting is very simple, so perhaps there is no reason to reproduce):
- unzip the attached zip (below).
- run
python run_seq2seq.py --model_name_or_path=t5-large
--do_train
--do_eval
--task=qa
--train_file=data/PAQ.filtered.regular.16000.json
--validation_file=data/PAQ.filtered.regular.16000.json
--output_dir=results/5e-5-t5-large-4096000-128-140-1792000-0.1-regular-true-4
--overwrite_output_dir
--per_device_train_batch_size=1
--per_device_eval_batch_size=128
--predict_with_generate
--fp16
--max_steps=1000
--evaluation_strategy=steps
--text_column=question
--summary_column=answer
--save_total_limit=5
--cache_dir=../.cache
--save_steps=500000
--learning_rate=5e-5
--eval_steps=96000
--warmup_steps=100
--run_name=5e-5-t5-large-4096000-128-140-1792000-0.1-regular-true-4
--dropout_rate=0.1
--gradient_accumulation_steps=1
--logging_steps=1
Expected behavior
Training without nans.
Possible fix
I debugged and saw that we get nans at the modeling_t5.py
script in line 241
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
By modifing this line to:
clamp_value = torch.finfo(hidden_states.dtype).max - 1000
hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value) * torch.rsqrt(variance + self.variance_epsilon)
It seems to be solved.
BTW it happens in the last layers (this might explain why it wasn't caught in this fix)
Metadata
Metadata
Assignees
Labels
No labels