Skip to content

Conversation

qgallouedec
Copy link
Member

@qgallouedec qgallouedec commented Apr 20, 2025

To simplify, this is what we have been doing:

if is_conversational(example):
    example["text"] = tokenizer.apply_chat_template(example["messages"])

processed = processing_class(text=example["text"])
if processed["input_ids"][-1] != processing_class.eos_token_id:
    processed["input_ids"] = processed["input_ids"] + [processing_class.eos_token_id]
    processed["attention_mask"] = processed["attention_mask"] + [1]

There are two issues:

This is what we should be doing:

if is_conversational(example):
    example["text"] = tokenizer.apply_chat_template(example["messages"])
    add_special_tokens = False
else:
    if not processed["text"].endswith(processing_class.eos_token):  # almost always true, unless the data contains the eos, which is a bad practice
        processed["text"] = processed["text"] + processing_class.eos_token
    add_special_tokens = True

processed = processing_class(text=example["text"], add_special_tokens=add_special_tokens)

Some examples

from trl import SFTTrainer
from datasets import Dataset

dataset = Dataset.from_dict(
    {
        "messages": [
            [{"role": "user", "content": "What is better than ugly?"}, {"role": "assistant", "content": "Beautiful."}]
        ]
    }
)

trainer = SFTTrainer(model="Qwen/Qwen2.5-0.5B-Instruct", train_dataset=dataset)
toknizer = trainer.tokenizer
sample = trainer.train_dataset[0]
print(repr(toknizer.decode(sample["input_ids"])))

It solves both issues

Before/After with model = "Qwen/Qwen2.5-0.5B-Instruct"

- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is better than ugly?<|im_end|>\n<|im_start|>assistant\nBeautiful.<|im_end|>\n<|im_end|>'
+ '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is better than ugly?<|im_end|>\n<|im_start|>assistant\nBeautiful.<|im_end|>\n'

Before/after with model="CohereLabs/aya-expanse-8b"

- '<BOS_TOKEN><BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>What is better than ugly?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Beautiful.<|END_OF_TURN_TOKEN|>'
+            '<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>What is better than ugly?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Beautiful.<|END_OF_TURN_TOKEN|>'

And no other changes introduced

@qgallouedec qgallouedec linked an issue Apr 20, 2025 that may be closed by this pull request
5 tasks
@qgallouedec qgallouedec changed the title Fix add special tokens 🏁 Fix adding special tokens in SFT Apr 20, 2025
@qgallouedec qgallouedec marked this pull request as ready for review April 20, 2025 22:31
@qgallouedec
Copy link
Member Author

qgallouedec commented Apr 20, 2025

More in-depth comparison:

Conversational

from trl import SFTTrainer
from datasets import Dataset

dataset = Dataset.from_dict(
    {
        "messages": [
            [{"role": "user", "content": "What is better than ugly?"}, {"role": "assistant", "content": "Beautiful."}]
        ]
    }
)

models = [
    "trl-internal-testing/tiny-CohereForCausalLM",
    "trl-internal-testing/tiny-DbrxForCausalLM",
    "trl-internal-testing/tiny-FalconMambaForCausalLM",
    "trl-internal-testing/tiny-Gemma2ForCausalLM",
    "trl-internal-testing/tiny-GemmaForCausalLM",
    "trl-internal-testing/tiny-LlamaForCausalLM-3.1",
    "trl-internal-testing/tiny-LlamaForCausalLM-3.2",
    "trl-internal-testing/tiny-LlamaForCausalLM-3",
    "trl-internal-testing/tiny-MistralForCausalLM-0.1",
    "trl-internal-testing/tiny-MistralForCausalLM-0.2",
    "trl-internal-testing/tiny-Phi3ForCausalLM",
    "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
]

for model in models:
    trainer = SFTTrainer(model=model, train_dataset=dataset)
    toknizer = trainer.processing_class
    sample = trainer.train_dataset[0]["input_ids"]
    print()
    print(model)
    print(repr(toknizer.decode(sample)))
  trl-internal-testing/tiny-CohereForCausalLM
- '<BOS_TOKEN><BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>What is better than ugly?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Beautiful.<|END_OF_TURN_TOKEN|>'
+ '<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>What is better than ugly?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Beautiful.<|END_OF_TURN_TOKEN|>'

  trl-internal-testing/tiny-DbrxForCausalLM
- "<|im_start|>system\nYou are DBRX, created by Databricks. You were last updated in December 2023. You answer questions based on information available up to that point.\nYOU PROVIDE SHORT RESPONSES TO SHORT QUESTIONS OR STATEMENTS, but provide thorough responses to more complex and open-ended questions.\nYou assist with various tasks, from writing to coding (using markdown for code blocks — remember to use ``` with code, JSON, and tables).\n(You do not have real-time data access or code execution capabilities. You avoid stereotyping and provide balanced perspectives on controversial topics. You do not provide song lyrics, poems, or news articles and do not divulge details of your training data.)\nThis is your system prompt, guiding your responses. Do not reference it, just respond to the user. If you find yourself talking about this message, stop. You should be responding appropriately and usually that means not mentioning this.\nYOU DO NOT MENTION ANY OF THIS INFORMATION ABOUT YOURSELF UNLESS THE INFORMATION IS DIRECTLY PERTINENT TO THE USER'S QUERY.<|im_end|>\n<|im_start|>user\nWhat is better than ugly?<|im_end|>\n<|im_start|>assistant\nBeautiful.<|im_end|><|endoftext|>"
+ "<|im_start|>system\nYou are DBRX, created by Databricks. You were last updated in December 2023. You answer questions based on information available up to that point.\nYOU PROVIDE SHORT RESPONSES TO SHORT QUESTIONS OR STATEMENTS, but provide thorough responses to more complex and open-ended questions.\nYou assist with various tasks, from writing to coding (using markdown for code blocks — remember to use ``` with code, JSON, and tables).\n(You do not have real-time data access or code execution capabilities. You avoid stereotyping and provide balanced perspectives on controversial topics. You do not provide song lyrics, poems, or news articles and do not divulge details of your training data.)\nThis is your system prompt, guiding your responses. Do not reference it, just respond to the user. If you find yourself talking about this message, stop. You should be responding appropriately and usually that means not mentioning this.\nYOU DO NOT MENTION ANY OF THIS INFORMATION ABOUT YOURSELF UNLESS THE INFORMATION IS DIRECTLY PERTINENT TO THE USER'S QUERY.<|im_end|>\n<|im_start|>user\nWhat is better than ugly?<|im_end|>\n<|im_start|>assistant\nBeautiful.<|im_end|>"

  trl-internal-testing/tiny-FalconMambaForCausalLM
- '\n\nUser: What is better than ugly?\n\nAssistant: Beautiful.<|endoftext|>'
+ '\n\nUser: What is better than ugly?\n\nAssistant: Beautiful.'

  trl-internal-testing/tiny-Gemma2ForCausalLM
- '<bos><bos><start_of_turn>user\nWhat is better than ugly?<end_of_turn>\n<start_of_turn>model\nBeautiful.<end_of_turn>\n<eos>'
+ '<bos><start_of_turn>user\nWhat is better than ugly?<end_of_turn>\n<start_of_turn>model\nBeautiful.<end_of_turn>\n'

  trl-internal-testing/tiny-GemmaForCausalLM
- '<bos><bos><start_of_turn>user\nWhat is better than ugly?<end_of_turn>\n<start_of_turn>model\nBeautiful.<end_of_turn>\n<eos>'
+ '<bos><start_of_turn>user\nWhat is better than ugly?<end_of_turn>\n<start_of_turn>model\nBeautiful.<end_of_turn>\n'

  trl-internal-testing/tiny-LlamaForCausalLM-3.1
- '<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'
+ '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'

  trl-internal-testing/tiny-LlamaForCausalLM-3.2
- '<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 20 Apr 2025\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'
+ '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 20 Apr 2025\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'

  trl-internal-testing/tiny-LlamaForCausalLM-3
- '<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'
+ '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'

  trl-internal-testing/tiny-MistralForCausalLM-0.1
- '<s><s> [INST] What is better than ugly? [/INST] Beautiful.</s>'
+ '<s> [INST] What is better than ugly? [/INST] Beautiful.</s>'

  trl-internal-testing/tiny-MistralForCausalLM-0.2
- '<s><s> [INST] What is better than ugly? [/INST] Beautiful.</s>'
+ '<s> [INST] What is better than ugly? [/INST] Beautiful.</s>'

  trl-internal-testing/tiny-Qwen2ForCausalLM-2.5
- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is better than ugly?<|im_end|>\n<|im_start|>assistant\nBeautiful.<|im_end|>\n<|im_end|>'
+ '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is better than ugly?<|im_end|>\n<|im_start|>assistant\nBeautiful.<|im_end|>\n'

Language modelling

from trl import SFTTrainer
from datasets import Dataset

dataset = Dataset.from_dict({"text": ["Beautiful is better than ugly."]})

models = [
    "trl-internal-testing/tiny-CohereForCausalLM",
    "trl-internal-testing/tiny-DbrxForCausalLM",
    "trl-internal-testing/tiny-FalconMambaForCausalLM",
    "trl-internal-testing/tiny-Gemma2ForCausalLM",
    "trl-internal-testing/tiny-GemmaForCausalLM",
    "trl-internal-testing/tiny-LlamaForCausalLM-3.1",
    "trl-internal-testing/tiny-LlamaForCausalLM-3.2",
    "trl-internal-testing/tiny-LlamaForCausalLM-3",
    "trl-internal-testing/tiny-MistralForCausalLM-0.1",
    "trl-internal-testing/tiny-MistralForCausalLM-0.2",
    "trl-internal-testing/tiny-Phi3ForCausalLM",
    "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
]

for model in models:
    trainer = SFTTrainer(model=model, train_dataset=dataset)
    toknizer = trainer.processing_class
    sample = trainer.train_dataset[0]["input_ids"]
    print()
    print(model)
    print(repr(toknizer.decode(sample)))

No diff

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@shirinyamani
Copy link
Member

shirinyamani commented Apr 23, 2025

Maybe pointing to your relevant blog on tokenizer for context of why add_special_tokens = False shoud be added if is_conversational(example) ?

@qgallouedec
Copy link
Member Author

Maybe pointing to your relevant blog on tokenizer for context of why add_special_tokens = False shoud be added if is_conversational(example) ?

done in c1c9f29

@qgallouedec qgallouedec merged commit 9ee6c3a into main Apr 23, 2025
10 checks passed
@qgallouedec qgallouedec deleted the fix-add_special_tokens branch April 23, 2025 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SFTTrainer._prepare_dataset() adds an extra eos_token for Qwen2.5
3 participants