Skip to content

✋ Prevent applying the chat template to tokenized datasets #2939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 24, 2025

Conversation

DanFosing
Copy link
Contributor

@DanFosing DanFosing commented Feb 23, 2025

Fixes:

  • SFTTrainer sometimes tried applying chat template to tokenized datasets, not sure why was it happening (there might be some errors in maybe_apply_chat_template and maybe_convert_to_chatml that caused it to happen?), this fix should prevent the code from even considering to do that if the dataset is already tokenized

@kashif
Copy link
Collaborator

kashif commented Feb 23, 2025

@DanFosing which version of TRL are you using?

@DanFosing
Copy link
Contributor Author

I experienced this issue with both v0.15.1 and with the alpha version downloaded using:
pip install git+https://github.com/huggingface/trl.git

@DanFosing
Copy link
Contributor Author

Oh and I forgot to mention, max_seq_length didn't seem to work for me for some reason, the warning says it will be deprecated in v.0.20.0 but are you sure it wasn't deprecated already? (that's why I added a comment there in the code, but it's not related to the main fix)

@kashif
Copy link
Collaborator

kashif commented Feb 24, 2025

@DanFosing ok so kindly remove the max_seq_length from the sft_config.py and move the chat template logic inside the already defined if not is_processed: where then it makes sense... instead of a new if not is_processed: block

@kashif
Copy link
Collaborator

kashif commented Feb 24, 2025

we can fix the warning and say: removed in version 0.16.0

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec
Copy link
Member

maybe_apply_chat_template applies the chat template if needed. hence "maybe". Are encountering a bug? If so what's the traceback?

@kashif
Copy link
Collaborator

kashif commented Feb 24, 2025

i don't think there is a bug, but please correct me @DanFosing if I am mistaken, the issue is that it's doing this extra work when it's not needed.

@qgallouedec
Copy link
Member

Oh and I forgot to mention, max_seq_length didn't seem to work for me for some reason

WDYM "didn't seem to work"? Same question, is an exception raised? if so what's the traceback?
Have you tried to pull the very last commits? Could be related to #2947

@qgallouedec
Copy link
Member

i don't think there is a bug, but please correct me @DanFosing if I am mistaken, the issue is that it's doing this extra work when it's not needed.

for clarification, the only extra work done is iterating through the dataset:

trl/trl/data_utils.py

Lines 218 to 221 in 5c05913

if is_conversational(example):
return apply_chat_template(example, tokenizer, tools)
else:
return example

which is usually very fast

@qgallouedec
Copy link
Member

That being said, I'm ok to add the if not is_processed: to avoid extra logging/iteration

@qgallouedec
Copy link
Member

@bot /style

Copy link
Contributor

Style fixes have been applied. View the workflow run here.

@qgallouedec qgallouedec changed the title Prevent applying the chat template to tokenized datasets ✋ Prevent applying the chat template to tokenized datasets Feb 24, 2025
@qgallouedec qgallouedec merged commit 4e0cf01 into huggingface:main Feb 24, 2025
kashif added a commit that referenced this pull request Feb 27, 2025
* Update sft_config.py

* Update sft_trainer.py

* Update sft_config.py

* Update sft_trainer.py

* Apply style fixes

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
jhinpan pushed a commit to jhinpan/trl-jin that referenced this pull request Mar 12, 2025
…ce#2939)

* Update sft_config.py

* Update sft_trainer.py

* Update sft_config.py

* Update sft_trainer.py

* Apply style fixes

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
yxliu-TAMU pushed a commit to mincheolseong/ECEN743-GRPO-Project-Proposal that referenced this pull request Apr 20, 2025
…ce#2939)

* Update sft_config.py

* Update sft_trainer.py

* Update sft_config.py

* Update sft_trainer.py

* Apply style fixes

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants