-
Notifications
You must be signed in to change notification settings - Fork 2.2k
🏁 Fix adding special tokens in SFT #3328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
More in-depth comparison: Conversationalfrom trl import SFTTrainer
from datasets import Dataset
dataset = Dataset.from_dict(
{
"messages": [
[{"role": "user", "content": "What is better than ugly?"}, {"role": "assistant", "content": "Beautiful."}]
]
}
)
models = [
"trl-internal-testing/tiny-CohereForCausalLM",
"trl-internal-testing/tiny-DbrxForCausalLM",
"trl-internal-testing/tiny-FalconMambaForCausalLM",
"trl-internal-testing/tiny-Gemma2ForCausalLM",
"trl-internal-testing/tiny-GemmaForCausalLM",
"trl-internal-testing/tiny-LlamaForCausalLM-3.1",
"trl-internal-testing/tiny-LlamaForCausalLM-3.2",
"trl-internal-testing/tiny-LlamaForCausalLM-3",
"trl-internal-testing/tiny-MistralForCausalLM-0.1",
"trl-internal-testing/tiny-MistralForCausalLM-0.2",
"trl-internal-testing/tiny-Phi3ForCausalLM",
"trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
]
for model in models:
trainer = SFTTrainer(model=model, train_dataset=dataset)
toknizer = trainer.processing_class
sample = trainer.train_dataset[0]["input_ids"]
print()
print(model)
print(repr(toknizer.decode(sample))) trl-internal-testing/tiny-CohereForCausalLM
- '<BOS_TOKEN><BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>What is better than ugly?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Beautiful.<|END_OF_TURN_TOKEN|>'
+ '<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>What is better than ugly?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Beautiful.<|END_OF_TURN_TOKEN|>'
trl-internal-testing/tiny-DbrxForCausalLM
- "<|im_start|>system\nYou are DBRX, created by Databricks. You were last updated in December 2023. You answer questions based on information available up to that point.\nYOU PROVIDE SHORT RESPONSES TO SHORT QUESTIONS OR STATEMENTS, but provide thorough responses to more complex and open-ended questions.\nYou assist with various tasks, from writing to coding (using markdown for code blocks — remember to use ``` with code, JSON, and tables).\n(You do not have real-time data access or code execution capabilities. You avoid stereotyping and provide balanced perspectives on controversial topics. You do not provide song lyrics, poems, or news articles and do not divulge details of your training data.)\nThis is your system prompt, guiding your responses. Do not reference it, just respond to the user. If you find yourself talking about this message, stop. You should be responding appropriately and usually that means not mentioning this.\nYOU DO NOT MENTION ANY OF THIS INFORMATION ABOUT YOURSELF UNLESS THE INFORMATION IS DIRECTLY PERTINENT TO THE USER'S QUERY.<|im_end|>\n<|im_start|>user\nWhat is better than ugly?<|im_end|>\n<|im_start|>assistant\nBeautiful.<|im_end|><|endoftext|>"
+ "<|im_start|>system\nYou are DBRX, created by Databricks. You were last updated in December 2023. You answer questions based on information available up to that point.\nYOU PROVIDE SHORT RESPONSES TO SHORT QUESTIONS OR STATEMENTS, but provide thorough responses to more complex and open-ended questions.\nYou assist with various tasks, from writing to coding (using markdown for code blocks — remember to use ``` with code, JSON, and tables).\n(You do not have real-time data access or code execution capabilities. You avoid stereotyping and provide balanced perspectives on controversial topics. You do not provide song lyrics, poems, or news articles and do not divulge details of your training data.)\nThis is your system prompt, guiding your responses. Do not reference it, just respond to the user. If you find yourself talking about this message, stop. You should be responding appropriately and usually that means not mentioning this.\nYOU DO NOT MENTION ANY OF THIS INFORMATION ABOUT YOURSELF UNLESS THE INFORMATION IS DIRECTLY PERTINENT TO THE USER'S QUERY.<|im_end|>\n<|im_start|>user\nWhat is better than ugly?<|im_end|>\n<|im_start|>assistant\nBeautiful.<|im_end|>"
trl-internal-testing/tiny-FalconMambaForCausalLM
- '\n\nUser: What is better than ugly?\n\nAssistant: Beautiful.<|endoftext|>'
+ '\n\nUser: What is better than ugly?\n\nAssistant: Beautiful.'
trl-internal-testing/tiny-Gemma2ForCausalLM
- '<bos><bos><start_of_turn>user\nWhat is better than ugly?<end_of_turn>\n<start_of_turn>model\nBeautiful.<end_of_turn>\n<eos>'
+ '<bos><start_of_turn>user\nWhat is better than ugly?<end_of_turn>\n<start_of_turn>model\nBeautiful.<end_of_turn>\n'
trl-internal-testing/tiny-GemmaForCausalLM
- '<bos><bos><start_of_turn>user\nWhat is better than ugly?<end_of_turn>\n<start_of_turn>model\nBeautiful.<end_of_turn>\n<eos>'
+ '<bos><start_of_turn>user\nWhat is better than ugly?<end_of_turn>\n<start_of_turn>model\nBeautiful.<end_of_turn>\n'
trl-internal-testing/tiny-LlamaForCausalLM-3.1
- '<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'
+ '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'
trl-internal-testing/tiny-LlamaForCausalLM-3.2
- '<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 20 Apr 2025\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'
+ '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 20 Apr 2025\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'
trl-internal-testing/tiny-LlamaForCausalLM-3
- '<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'
+ '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nWhat is better than ugly?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBeautiful.<|eot_id|>'
trl-internal-testing/tiny-MistralForCausalLM-0.1
- '<s><s> [INST] What is better than ugly? [/INST] Beautiful.</s>'
+ '<s> [INST] What is better than ugly? [/INST] Beautiful.</s>'
trl-internal-testing/tiny-MistralForCausalLM-0.2
- '<s><s> [INST] What is better than ugly? [/INST] Beautiful.</s>'
+ '<s> [INST] What is better than ugly? [/INST] Beautiful.</s>'
trl-internal-testing/tiny-Qwen2ForCausalLM-2.5
- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is better than ugly?<|im_end|>\n<|im_start|>assistant\nBeautiful.<|im_end|>\n<|im_end|>'
+ '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is better than ugly?<|im_end|>\n<|im_start|>assistant\nBeautiful.<|im_end|>\n' Language modellingfrom trl import SFTTrainer
from datasets import Dataset
dataset = Dataset.from_dict({"text": ["Beautiful is better than ugly."]})
models = [
"trl-internal-testing/tiny-CohereForCausalLM",
"trl-internal-testing/tiny-DbrxForCausalLM",
"trl-internal-testing/tiny-FalconMambaForCausalLM",
"trl-internal-testing/tiny-Gemma2ForCausalLM",
"trl-internal-testing/tiny-GemmaForCausalLM",
"trl-internal-testing/tiny-LlamaForCausalLM-3.1",
"trl-internal-testing/tiny-LlamaForCausalLM-3.2",
"trl-internal-testing/tiny-LlamaForCausalLM-3",
"trl-internal-testing/tiny-MistralForCausalLM-0.1",
"trl-internal-testing/tiny-MistralForCausalLM-0.2",
"trl-internal-testing/tiny-Phi3ForCausalLM",
"trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
]
for model in models:
trainer = SFTTrainer(model=model, train_dataset=dataset)
toknizer = trainer.processing_class
sample = trainer.train_dataset[0]["input_ids"]
print()
print(model)
print(repr(toknizer.decode(sample))) No diff |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Maybe pointing to your relevant blog on tokenizer for context of why |
done in c1c9f29 |
To simplify, this is what we have been doing:
There are two issues:
This is what we should be doing:
Some examples
It solves both issues
Before/After with
model = "Qwen/Qwen2.5-0.5B-Instruct"
Before/after with
model="CohereLabs/aya-expanse-8b"
And no other changes introduced