Skip to content

Allow setting a seed for DataCollatorForLanguageModeling #36357

@capemox

Description

@capemox

Feature request

The DataCollatorForLanguageModeling class allows training for an MLM (masked language model) task, which randomly masks or replaces certain tokens. Models such as BERT and RoBERTa are trained in such a manner. It would be great if the user can set a seed, ensuring repeatability in generating masked batches.

Motivation

This would ensure generation of repeatable batches of data, which is critical for model reproducibility. Right now, there is a form of repeatability with transformers.set_seed(), but one can make use of generators(PyTorch, Tensorflow, NumPy) to set the data collator seed without globally setting the seed for each framework. The major reason this would help is that the MLM masking probabilities would not be influenced by code outside of it, which is good practice. This would mean that, given the same dataset and seed, the masking would happen consistently irrespective of the rest of your training script. See this blog post for more details.

Your contribution

I can submit a PR for this. I have experience with TF, PyTorch and NumPy, would love to contribute. I have taken a look at the code, and can add a seed argument which enables usage of generators for repeatability. If not specified, however, the code would fall back to its previous behavior, including using transformers.set_seed().

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions