-
Notifications
You must be signed in to change notification settings - Fork 30.2k
Description
Feature request
The DataCollatorForLanguageModeling
class allows training for an MLM (masked language model) task, which randomly masks or replaces certain tokens. Models such as BERT and RoBERTa are trained in such a manner. It would be great if the user can set a seed, ensuring repeatability in generating masked batches.
Motivation
This would ensure generation of repeatable batches of data, which is critical for model reproducibility. Right now, there is a form of repeatability with transformers.set_seed()
, but one can make use of generators(PyTorch, Tensorflow, NumPy) to set the data collator seed without globally setting the seed for each framework. The major reason this would help is that the MLM masking probabilities would not be influenced by code outside of it, which is good practice. This would mean that, given the same dataset and seed, the masking would happen consistently irrespective of the rest of your training script. See this blog post for more details.
Your contribution
I can submit a PR for this. I have experience with TF, PyTorch and NumPy, would love to contribute. I have taken a look at the code, and can add a seed
argument which enables usage of generators for repeatability. If not specified, however, the code would fall back to its previous behavior, including using transformers.set_seed()
.