🏁 Fix adding special tokens in SFT #3328

qgallouedec · 2025-04-20T18:17:22Z

To simplify, this is what we have been doing:

if is_conversational(example):
    example["text"] = tokenizer.apply_chat_template(example["messages"])

processed = processing_class(text=example["text"])
if processed["input_ids"][-1] != processing_class.eos_token_id:
    processed["input_ids"] = processed["input_ids"] + [processing_class.eos_token_id]
    processed["attention_mask"] = processed["attention_mask"] + [1]

There are two issues:

It could add the special tokens (mostly BOS) twice
It could add the EOS twice if the EOS isn't in the end, see SFTTrainer._prepare_dataset() adds an extra eos_token for Qwen2.5 #3318

This is what we should be doing:

if is_conversational(example):
    example["text"] = tokenizer.apply_chat_template(example["messages"])
    add_special_tokens = False
else:
    if not processed["text"].endswith(processing_class.eos_token):  # almost always true, unless the data contains the eos, which is a bad practice
        processed["text"] = processed["text"] + processing_class.eos_token
    add_special_tokens = True

processed = processing_class(text=example["text"], add_special_tokens=add_special_tokens)

Some examples

from trl import SFTTrainer
from datasets import Dataset

dataset = Dataset.from_dict(
    {
        "messages": [
            [{"role": "user", "content": "What is better than ugly?"}, {"role": "assistant", "content": "Beautiful."}]
        ]
    }
)

trainer = SFTTrainer(model="Qwen/Qwen2.5-0.5B-Instruct", train_dataset=dataset)
toknizer = trainer.tokenizer
sample = trainer.train_dataset[0]
print(repr(toknizer.decode(sample["input_ids"])))

It solves both issues

Before/After with model = "Qwen/Qwen2.5-0.5B-Instruct"

- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is better than ugly?<|im_end|>\n<|im_start|>assistant\nBeautiful.<|im_end|>\n<|im_end|>'
+ '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is better than ugly?<|im_end|>\n<|im_start|>assistant\nBeautiful.<|im_end|>\n'

Before/after with model="CohereLabs/aya-expanse-8b"

- '<BOS_TOKEN><BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>What is better than ugly?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Beautiful.<|END_OF_TURN_TOKEN|>'
+            '<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>What is better than ugly?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Beautiful.<|END_OF_TURN_TOKEN|>'

And no other changes introduced

trl/trainer/sft_trainer.py

qgallouedec · 2025-04-20T22:52:39Z

Conversational

from trl import SFTTrainer
from datasets import Dataset

dataset = Dataset.from_dict(
    {
        "messages": [
            [{"role": "user", "content": "What is better than ugly?"}, {"role": "assistant", "content": "Beautiful."}]
        ]
    }
)

models = [
    "trl-internal-testing/tiny-CohereForCausalLM",
    "trl-internal-testing/tiny-DbrxForCausalLM",
    "trl-internal-testing/tiny-FalconMambaForCausalLM",
    "trl-internal-testing/tiny-Gemma2ForCausalLM",
    "trl-internal-testing/tiny-GemmaForCausalLM",
    "trl-internal-testing/tiny-LlamaForCausalLM-3.1",
    "trl-internal-testing/tiny-LlamaForCausalLM-3.2",
    "trl-internal-testing/tiny-LlamaForCausalLM-3",
    "trl-internal-testing/tiny-MistralForCausalLM-0.1",
    "trl-internal-testing/tiny-MistralForCausalLM-0.2",
    "trl-internal-testing/tiny-Phi3ForCausalLM",
    "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
]

for model in models:
    trainer = SFTTrainer(model=model, train_dataset=dataset)
    toknizer = trainer.processing_class
    sample = trainer.train_dataset[0]["input_ids"]
    print()
    print(model)
    print(repr(toknizer.decode(sample)))

  trl-internal-testing/tiny-CohereForCausalLM
- '<BOS_TOKEN><BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>What is better than ugly?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Beautiful.<|END_OF_TURN_TOKEN|>'
+ '<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>What is better than ugly?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Beautiful.<|END_OF_TURN_TOKEN|>'

  trl-internal-testing/tiny-DbrxForCausalLM
- "<|im_start|>system\nYou are DBRX, created by Databricks. You were last updated in December 2023. You answer questions based on information available up to that point.\nYOU PROVIDE SHORT RESPONSES TO SHORT QUESTIONS OR STATEMENTS, but provide thorough responses to more complex and open-ended questions.\nYou assist with various tasks, from writing to coding (using markdown for code blocks — remember to use ``` with code, JSON, and tables).\n(You do not have real-time data access or code execution capabilities. You avoid stereotyping and provide balanced perspectives on controversial topics. You do not provide song lyrics, poems, or news articles and do not divulge details of your training data.)\nThis is your system prompt, guiding your responses. Do not reference it, just respond to the user. If you find yourself talking about this message, stop. You should be responding appropriately and usually that means not mentioning this.\nYOU DO NOT MENTION ANY OF THIS INFORMATION ABOUT YOURSELF UNLESS THE INFORMATION IS DIRECTLY PERTINENT TO THE USER'S QUERY.<|im_end|>\n<|im_start|>user\nWhat is better than ugly?<|im_end|>\n<|im_start|>assistant\nBeautiful.<|im_end|><|endoftext|>"
+ "<|im_start|>system\nYou are DBRX, created by Databricks. You were last updated in December 2023. You answer questions based on information available up to that point.\nYOU PROVIDE SHORT RESPONSES TO SHORT QUESTIONS OR STATEMENTS, but provide thorough responses to more complex and open-ended questions.\nYou assist with various tasks, from writing to coding (using markdown for code blocks — remember to use ``` with code, JSON, and tables).\n(You do not have real-time data access or code execution capabilities. You avoid stereotyping and provide balanced perspectives on controversial topics. You do not provide song lyrics, poems, or news articles and do not divulge details of your training data.)\nThis is your system prompt, guiding your responses. Do not reference it, just respond to the user. If you find yourself talking about this message, stop. You should be responding appropriately and usually that means not mentioning this.\nYOU DO NOT MENTION ANY OF THIS INFORMATION ABOUT YOURSELF UNLESS THE INFORMATION IS DIRECTLY PERTINENT TO THE USER'S QUERY.<|im_end|>\n<|im_start|>user\nWhat is better than ugly?<|im_end|>\n<|im_start|>assistant\nBeautiful.<|im_end|>"

  trl-internal-testing/tiny-FalconMambaForCausalLM
- '\n\nUser: What is better than ugly?\n\nAssistant: Beautiful.<|endoftext|>'
+ '\n\nUser: What is better than ugly?\n\nAssistant: Beautiful.'

  trl-internal-testing/tiny-Gemma2ForCausalLM
- '<bos><bos><start_of_turn>user\nWhat is better than ugly?<end_of_turn>\n<start_of_turn>model\nBeautiful.<end_of_turn>\n<eos>'
+ '<bos><start_of_turn>user\nWhat is better than ugly?<end_of_turn>\n<start_of_turn>model\nBeautiful.<end_of_turn>\n'

  trl-internal-testing/tiny-GemmaForCausalLM
- '<bos><bos><start_of_turn>user\nWhat is better than ugly?<end_of_turn>\n<start_of_turn>model\nBeautiful.<end_of_turn>\n<eos>'
+ '<bos><start_of_turn>user\nWhat is better than ugly?<end_of_turn>\n<start_of_turn>model\nBeautiful.<end_of_turn>\n'

  trl-internal-testing/tiny-LlamaForCausalLM-3.1
- '<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'
+ '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'

  trl-internal-testing/tiny-LlamaForCausalLM-3.2
- '<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 20 Apr 2025\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'
+ '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 20 Apr 2025\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'

  trl-internal-testing/tiny-LlamaForCausalLM-3
- '<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'
+ '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'

  trl-internal-testing/tiny-MistralForCausalLM-0.1
- '<s><s> [INST] What is better than ugly? [/INST] Beautiful.</s>'
+ '<s> [INST] What is better than ugly? [/INST] Beautiful.</s>'

  trl-internal-testing/tiny-MistralForCausalLM-0.2
- '<s><s> [INST] What is better than ugly? [/INST] Beautiful.</s>'
+ '<s> [INST] What is better than ugly? [/INST] Beautiful.</s>'

  trl-internal-testing/tiny-Qwen2ForCausalLM-2.5
- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is better than ugly?<|im_end|>\n<|im_start|>assistant\nBeautiful.<|im_end|>\n<|im_end|>'
+ '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is better than ugly?<|im_end|>\n<|im_start|>assistant\nBeautiful.<|im_end|>\n'

Language modelling

from trl import SFTTrainer
from datasets import Dataset

dataset = Dataset.from_dict({"text": ["Beautiful is better than ugly."]})

models = [
    "trl-internal-testing/tiny-CohereForCausalLM",
    "trl-internal-testing/tiny-DbrxForCausalLM",
    "trl-internal-testing/tiny-FalconMambaForCausalLM",
    "trl-internal-testing/tiny-Gemma2ForCausalLM",
    "trl-internal-testing/tiny-GemmaForCausalLM",
    "trl-internal-testing/tiny-LlamaForCausalLM-3.1",
    "trl-internal-testing/tiny-LlamaForCausalLM-3.2",
    "trl-internal-testing/tiny-LlamaForCausalLM-3",
    "trl-internal-testing/tiny-MistralForCausalLM-0.1",
    "trl-internal-testing/tiny-MistralForCausalLM-0.2",
    "trl-internal-testing/tiny-Phi3ForCausalLM",
    "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
]

for model in models:
    trainer = SFTTrainer(model=model, train_dataset=dataset)
    toknizer = trainer.processing_class
    sample = trainer.train_dataset[0]["input_ids"]
    print()
    print(model)
    print(repr(toknizer.decode(sample)))

No diff

HuggingFaceDocBuilderDev · 2025-04-20T23:06:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ce/trl into fix-add_special_tokens

shirinyamani · 2025-04-23T00:04:59Z

Maybe pointing to your relevant blog on tokenizer for context of why add_special_tokens = False shoud be added if is_conversational(example) ?

trl/trainer/sft_trainer.py

qgallouedec · 2025-04-23T00:30:10Z

Maybe pointing to your relevant blog on tokenizer for context of why add_special_tokens = False shoud be added if is_conversational(example) ?

done in c1c9f29

qgallouedec added 4 commits April 20, 2025 18:16

fix add_special_token grpo

545d8b8

same for sft

791935b

completion_only_loss

2f533e0

revert grpo change

7b248f6

qgallouedec commented Apr 20, 2025

View reviewed changes

trl/trainer/sft_trainer.py Show resolved Hide resolved

qgallouedec added 3 commits April 20, 2025 22:03

split prs

96a3d00

split PR 2

51ca90d

comment

f963d70

qgallouedec linked an issue Apr 20, 2025 that may be closed by this pull request

SFTTrainer._prepare_dataset() adds an extra eos_token for Qwen2.5 #3318

Closed

5 tasks

qgallouedec changed the title ~~Fix add special tokens~~ 🏁 Fix adding special tokens in SFT Apr 20, 2025

qgallouedec marked this pull request as ready for review April 20, 2025 22:31

qgallouedec requested review from kashif, edbeeching, lewtun and shirinyamani April 20, 2025 22:32

qgallouedec and others added 2 commits April 20, 2025 23:01

Empty commit to trigger CI

ad4f7d8

Merge branch 'main' into fix-add_special_tokens

aa08e8a

qgallouedec and others added 3 commits April 20, 2025 16:11

Merge branch 'main' into fix-add_special_tokens

73ef586

update test

974e916

Merge branch 'fix-add_special_tokens' of https://github.com/huggingfa…

6d3edbc

…ce/trl into fix-add_special_tokens

shirinyamani approved these changes Apr 23, 2025

View reviewed changes

qgallouedec commented Apr 23, 2025

View reviewed changes

trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved

trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved

Apply suggestions from code review

c1c9f29

qgallouedec added 2 commits April 22, 2025 17:30

Merge branch 'main' into fix-add_special_tokens

858bbbd

⏩ Train on completion only (#3329)

9497527

qgallouedec merged commit 9ee6c3a into main Apr 23, 2025
10 checks passed

qgallouedec deleted the fix-add_special_tokens branch April 23, 2025 00:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🏁 Fix adding special tokens in SFT #3328

🏁 Fix adding special tokens in SFT #3328

Uh oh!

qgallouedec commented Apr 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

qgallouedec commented Apr 20, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 20, 2025

Uh oh!

shirinyamani commented Apr 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Apr 23, 2025

Uh oh!

Uh oh!

Uh oh!

🏁 Fix adding special tokens in SFT #3328

🏁 Fix adding special tokens in SFT #3328

Uh oh!

Conversation

qgallouedec commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

To simplify, this is what we have been doing:

This is what we should be doing:

Some examples

It solves both issues

And no other changes introduced

Uh oh!

Uh oh!

qgallouedec commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Conversational

Language modelling

Uh oh!

HuggingFaceDocBuilderDev commented Apr 20, 2025

Uh oh!

shirinyamani commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Apr 23, 2025

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Apr 20, 2025 •

edited

Loading

qgallouedec commented Apr 20, 2025 •

edited

Loading

shirinyamani commented Apr 23, 2025 •

edited

Loading