Allow setting a seed for DataCollatorForLanguageModeling

### Feature request

The `DataCollatorForLanguageModeling` class allows training for an MLM (masked language model) task, which randomly masks or replaces certain tokens. Models such as BERT and RoBERTa are trained in such a manner. It would be great if the user can set a seed, ensuring repeatability in generating masked batches.

### Motivation

This would ensure generation of repeatable batches of data, which is critical for model reproducibility. Right now, there is a form of repeatability with `transformers.set_seed()`, but one can make use of generators([PyTorch](https://pytorch.org/docs/stable/generated/torch.Generator.html), [Tensorflow](https://www.tensorflow.org/api_docs/python/tf/random/Generator), [NumPy](https://numpy.org/doc/2.1/reference/random/generator.html)) to set the data collator seed without globally setting the seed for each framework. The major reason this would help is that the MLM masking probabilities would not be influenced by code outside of it, which is good practice. This would mean that, given the same dataset and seed, the masking would happen consistently irrespective of the rest of your training script. See [this blog post for more details](https://albertcthomas.github.io/good-practices-random-number-generators/#random-number-generation-with-numpy). 

### Your contribution

I can submit a PR for this. I have experience with TF, PyTorch and NumPy, would love to contribute. I have taken a look at the code, and can add a `seed` argument which enables usage of generators for repeatability. If not specified, however, the code would fall back to its previous behavior, including using `transformers.set_seed()`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow setting a seed for DataCollatorForLanguageModeling #36357

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow setting a seed for DataCollatorForLanguageModeling #36357

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions