😷 Fix SFT masking EOS when equal to PAD #3200

qgallouedec · 2025-03-31T23:43:44Z

What does this PR do?

Fixes:

This PR fixes the bug observed many times in which the SFT model seems to have unlearned how to generate the EOS. This is due to the fact that here we have the masking logic

labels = batch["input_ids"].clone()
if self.tokenizer.pad_token_id is not None:
    labels[labels == self.tokenizer.pad_token_id] = -100

So, if EOS=PAD, then all EOS are masked in the loss.

To solve the problem, we adopt a method that doesn't rely on the token value to determine whether it should be masked in the loss.

This is based on the addition of our own Collator.

Why did I choose to add a new collator?

On the one hand, to fix the issue, but also to prepare for the future, as it gives us more control to

natively support multi-modal data, as we already do in DPO, see here:

trl/trl/trainer/dpo_trainer.py

Lines 135 to 141 in 9f3702f

    
           if "pixel_values" in examples[0]: 
        
               pixel_values = [torch.tensor(example["pixel_values"]) for example in examples] 
        
           if "pixel_attention_mask" in examples[0]: 
        
               pixel_attention_mask = [torch.tensor(example["pixel_attention_mask"]) for example in examples] 
        
           if "ref_chosen_logps" in examples[0] and "ref_rejected_logps" in examples[0]: 
        
               ref_chosen_logps = torch.tensor([example["ref_chosen_logps"] for example in examples]) 
        
               ref_rejected_logps = torch.tensor([example["ref_rejected_logps"] for example in examples])

train on completion only, without relying on a kind of reverse chat template.

… fix-eos-sft

qgallouedec · 2025-04-01T17:55:06Z

Experiments:

The following code was run on both branch main and this branch, with both packing and not.

from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from accelerate import PartialState
from transformers import AutoTokenizer


def main():
    dataset = load_dataset("trl-lib/Capybara", split="train")
    model_id = "meta-llama/Llama-3.2-3B"

    def func(example):
        messages = example["messages"]
        messages = [f"{message['role']}: {message['content']}" for message in messages]
        text = "\n".join(messages)
        return {"text": text}

    with PartialState().main_process_first():
        dataset = dataset.map(func, remove_columns=dataset.column_names)

    tokenizer = AutoTokenizer.from_pretrained(model_id)

    # The bug occurs when the pad token is set to the eos token.
    # We intentionally keep this line to verify that the fix works.
    tokenizer.pad_token = tokenizer.eos_token

    trainer = SFTTrainer(
        model=model_id,
        args=SFTConfig(
            output_dir="Llama-3.2-3B-556-2-fix-pack",
            max_length=4096,
            gradient_checkpointing=True,
            per_device_train_batch_size=4,
            logging_steps=5,
            save_steps=20,
            bf16=True,
            dataset_num_proc=16,
            num_train_epochs=1,
            packing=True,
        ),
        train_dataset=dataset,
        processing_class=tokenizer,
    )
    trainer.train()


if __name__ == "__main__":
    main()

accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml sandbox/3200.py

The learning curves match, as expected:

No packing

Packing

The length distribution after training, which validates the bug is fixed:

HuggingFaceDocBuilderDev · 2025-04-01T18:25:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

trl/trainer/sft_trainer.py

qgallouedec · 2025-04-01T20:25:57Z

tests/slow/test_sft_slow.py

@@ -106,7 +106,6 @@ def test_sft_trainer_transformers(self, model_name, packing):

            model = AutoModelForCausalLM.from_pretrained(model_name)
            tokenizer = AutoTokenizer.from_pretrained(model_name)
-            tokenizer.pad_token = tokenizer.eos_token if tokenizer.pad_token is None else tokenizer.pad_token


we don't need this anymore

I will do the same in our SFT script in open-r1 once this is merged

qgallouedec · 2025-04-01T20:27:10Z

trl/trainer/sft_trainer.py

-        # Model
-        if args.model_init_kwargs is not None and not isinstance(model, str):
-            warnings.warn(
-                "You passed model_init_kwargs to the `SFTConfig`, but your model is already instantiated. "
-                "The `model_init_kwargs` will be ignored."
-            )
-        if isinstance(model, str):
-            model = self._create_model_from_path(model, args)
-
-        # PEFT configuration and model wrapping
-        if peft_config is not None:
-            model = self._prepare_peft_model(model, peft_config, args)
-


This is moved down so that the user doesn't wait for the model to be loaded to get error if the pad token is not correctly specified

qgallouedec · 2025-04-01T20:28:45Z

trl/trainer/sft_trainer.py

-            if processing_class.pad_token is None:
-                processing_class.pad_token = processing_class.eos_token  # required for padding when collating data
-
-        # Dataset
-        preprocess_dataset = args.dataset_kwargs is None or not args.dataset_kwargs.get("skip_prepare_dataset", False)
-        if preprocess_dataset:
-            train_dataset = self._prepare_dataset(
-                train_dataset, processing_class, args, args.packing, formatting_func, "train"
-            )
-            if eval_dataset is not None:
-                packing = args.packing if args.eval_packing is None else args.eval_packing
-                if isinstance(eval_dataset, dict):
-                    eval_dataset = {
-                        key: self._prepare_dataset(dataset, processing_class, args, packing, formatting_func, key)
-                        for key, dataset in eval_dataset.items()
-                    }
-                else:
-                    eval_dataset = self._prepare_dataset(
-                        eval_dataset, processing_class, args, packing, formatting_func, "eval"
-                    )


This is also moved down so that the user doesn't have to wait for the dataset to be processed to get an error if the pad token is not correctly specified.

lewtun

Great detective work getting to the bottom of this @qgallouedec ! Logic LGTM with a question about what happens if the user provides a pad_token string that splits into multiple token IDs

lewtun · 2025-04-02T08:51:29Z

tests/slow/test_sft_slow.py

@@ -106,7 +106,6 @@ def test_sft_trainer_transformers(self, model_name, packing):

            model = AutoModelForCausalLM.from_pretrained(model_name)
            tokenizer = AutoTokenizer.from_pretrained(model_name)
-            tokenizer.pad_token = tokenizer.eos_token if tokenizer.pad_token is None else tokenizer.pad_token


I will do the same in our SFT script in open-r1 once this is merged

lewtun · 2025-04-02T08:55:59Z

trl/trainer/sft_trainer.py

+            # Get the pad token: if not provided, use the one from the processing class or the eos token
+            # if the processing class does not have a pad token.
+            pad_token = args.pad_token or processing_class.pad_token or processing_class.eos_token
+            pad_token_id = processing_class.convert_tokens_to_ids(pad_token)


What happens here if pad_token is not a single token in the vocab? I.e. if the user passes hello and convert_tokens_to_ids givens 2 token IDs?

This processing_class.convert_tokens_to_ids(pad_token) would returns None and then this exception is raised:

if pad_token_id is None: raise ValueError( f"The specified `pad_token` ('{pad_token}') is not found in the vocabulary of the given " f"`processing_class` ({processing_class.__class__.__name__}). Ensure that the `pad_token` exists " "in the vocabulary before using it as a padding token." )

>>> from trl import SFTTrainer, SFTConfig >>> from datasets import load_dataset >>> dataset = load_dataset("trl-lib/Capybara", split="train") >>> trainer = SFTTrainer( ... model="Qwen/Qwen2.5-0.5B", ... args=SFTConfig(pad_token="this is a bit long for a pad token"), ... train_dataset=dataset, >>> ) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/fsx/qgallouedec/trl/trl/trainer/sft_trainer.py", line 263, in __init__ raise ValueError( ValueError: The specified `pad_token` ('this is a bit long for a pad token') is not found in the vocabulary of the given `processing_class` (Qwen2TokenizerFast). Ensure that the `pad_token` exists in the vocabulary before using it as a padding token.

qgallouedec and others added 10 commits March 14, 2025 21:14

Add EOS token to processed input

661043f

Update sft_trainer.py

ecc9326

fix test

7aef131

collator

bd4d3a0

Merge branch 'main' into fix-eos-sft

6103541

update collator and test

47189a8

Merge branch 'fix-eos-sft' of https://github.com/huggingface/trl into…

3053fb4

… fix-eos-sft

arg for pad token

d04ce9d

Merge branch 'main' into fix-eos-sft

ed5b1bf

Merge branch 'fix-eos-sft' of https://github.com/huggingface/trl into…

7854014

… fix-eos-sft

qgallouedec marked this pull request as ready for review April 1, 2025 18:05

qgallouedec mentioned this pull request Apr 1, 2025

Why SFTTrainer process instruction data without EOS? #3083

Closed

Empty commit to trigger CI

f1a285b

qgallouedec requested review from kashif, edbeeching, lewtun and shirinyamani April 1, 2025 18:21

remove setting PAD to EOS (I'd say it's bad practice)

54527da

qgallouedec changed the title ~~Fix eos sft~~ 😷 Fix SFT masking EOS when equal to PAD Apr 1, 2025

dumy pad tokens to fix tests

abae70d

qgallouedec commented Apr 1, 2025

View reviewed changes

trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved

qgallouedec added 3 commits April 1, 2025 20:08

Esier like this

10728f3

pad_token_id -> pad_token

520a209

reorder

778af2b

qgallouedec commented Apr 1, 2025

View reviewed changes

fix tokenizer definition

78925c3

lewtun approved these changes Apr 2, 2025

View reviewed changes

qgallouedec merged commit 485852c into main Apr 2, 2025
8 of 10 checks passed

qgallouedec deleted the fix-eos-sft branch April 2, 2025 15:56

qgallouedec added a commit that referenced this pull request Apr 3, 2025

😷 Fix SFT masking EOS when equal to PAD (#3200)

40a5e95

This was referenced Apr 15, 2025

Fix SFT for base models huggingface/open-r1#604

Merged

Expose EOS token in SFTConfig #3299

Merged

qgallouedec mentioned this pull request Apr 15, 2025

🧸 Fix unset tokenizer pad_token #3290

Merged

5 tasks

yxliu-TAMU pushed a commit to mincheolseong/ECEN743-GRPO-Project-Proposal that referenced this pull request Apr 20, 2025

😷 Fix SFT masking EOS when equal to PAD (huggingface#3200)

62f5d4e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

😷 Fix SFT masking EOS when equal to PAD #3200

😷 Fix SFT masking EOS when equal to PAD #3200

Uh oh!

qgallouedec commented Mar 31, 2025 •

edited

Loading

Uh oh!

qgallouedec commented Apr 1, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 1, 2025

Uh oh!

Uh oh!

qgallouedec Apr 1, 2025

Uh oh!

lewtun Apr 2, 2025

Uh oh!

qgallouedec Apr 1, 2025

Uh oh!

qgallouedec Apr 1, 2025

Uh oh!

lewtun left a comment

Uh oh!

lewtun Apr 2, 2025

Uh oh!

lewtun Apr 2, 2025

Uh oh!

qgallouedec Apr 2, 2025 •

edited

Loading

Uh oh!

qgallouedec Apr 2, 2025

Uh oh!

Uh oh!

Uh oh!

	if "pixel_values" in examples[0]:
	pixel_values = [torch.tensor(example["pixel_values"]) for example in examples]
	if "pixel_attention_mask" in examples[0]:
	pixel_attention_mask = [torch.tensor(example["pixel_attention_mask"]) for example in examples]
	if "ref_chosen_logps" in examples[0] and "ref_rejected_logps" in examples[0]:
	ref_chosen_logps = torch.tensor([example["ref_chosen_logps"] for example in examples])
	ref_rejected_logps = torch.tensor([example["ref_rejected_logps"] for example in examples])

😷 Fix SFT masking EOS when equal to PAD #3200

😷 Fix SFT masking EOS when equal to PAD #3200

Uh oh!

Conversation

qgallouedec commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why did I choose to add a new collator?

Uh oh!

qgallouedec commented Apr 1, 2025

No packing

Packing

Uh oh!

HuggingFaceDocBuilderDev commented Apr 1, 2025

Uh oh!

Uh oh!

qgallouedec Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

lewtun Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

lewtun left a comment

Choose a reason for hiding this comment

Uh oh!

lewtun Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

lewtun Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Mar 31, 2025 •

edited

Loading

qgallouedec Apr 2, 2025 •

edited

Loading