✋ Prevent applying the chat template to tokenized datasets #2939

DanFosing · 2025-02-23T18:38:22Z

Fixes:

SFTTrainer sometimes tried applying chat template to tokenized datasets, not sure why was it happening (there might be some errors in maybe_apply_chat_template and maybe_convert_to_chatml that caused it to happen?), this fix should prevent the code from even considering to do that if the dataset is already tokenized

kashif · 2025-02-23T18:42:51Z

@DanFosing which version of TRL are you using?

DanFosing · 2025-02-23T19:02:31Z

I experienced this issue with both v0.15.1 and with the alpha version downloaded using:
pip install git+https://github.com/huggingface/trl.git

DanFosing · 2025-02-23T19:05:41Z

Oh and I forgot to mention, max_seq_length didn't seem to work for me for some reason, the warning says it will be deprecated in v.0.20.0 but are you sure it wasn't deprecated already? (that's why I added a comment there in the code, but it's not related to the main fix)

kashif · 2025-02-24T17:27:26Z

@DanFosing ok so kindly remove the max_seq_length from the sft_config.py and move the chat template logic inside the already defined if not is_processed: where then it makes sense... instead of a new if not is_processed: block

kashif · 2025-02-24T17:39:56Z

we can fix the warning and say: removed in version 0.16.0

HuggingFaceDocBuilderDev · 2025-02-24T17:44:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-02-24T18:58:21Z

maybe_apply_chat_template applies the chat template if needed. hence "maybe". Are encountering a bug? If so what's the traceback?

kashif · 2025-02-24T19:00:50Z

i don't think there is a bug, but please correct me @DanFosing if I am mistaken, the issue is that it's doing this extra work when it's not needed.

qgallouedec · 2025-02-24T19:01:05Z

Oh and I forgot to mention, max_seq_length didn't seem to work for me for some reason

WDYM "didn't seem to work"? Same question, is an exception raised? if so what's the traceback?
Have you tried to pull the very last commits? Could be related to #2947

qgallouedec · 2025-02-24T19:03:13Z

i don't think there is a bug, but please correct me @DanFosing if I am mistaken, the issue is that it's doing this extra work when it's not needed.

for clarification, the only extra work done is iterating through the dataset:

trl/trl/data_utils.py

Lines 218 to 221 in 5c05913

    
           if is_conversational(example): 
        
               return apply_chat_template(example, tokenizer, tools) 
        
           else: 
        
               return example

which is usually very fast

qgallouedec · 2025-02-24T19:05:34Z

That being said, I'm ok to add the if not is_processed: to avoid extra logging/iteration

qgallouedec · 2025-02-24T20:34:21Z

@bot /style

github-actions · 2025-02-24T20:34:47Z

Style fixes have been applied. View the workflow run here.

* Update sft_config.py * Update sft_trainer.py * Update sft_config.py * Update sft_trainer.py * Apply style fixes --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…ce#2939) * Update sft_config.py * Update sft_trainer.py * Update sft_config.py * Update sft_trainer.py * Apply style fixes --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

DanFosing added 2 commits February 23, 2025 19:16

Update sft_config.py

10d3499

Update sft_trainer.py

f9dd493

Merge branch 'main' into main

696a554

DanFosing added 2 commits February 24, 2025 18:34

Update sft_config.py

55a9753

Update sft_trainer.py

e33be0a

qgallouedec approved these changes Feb 24, 2025

View reviewed changes

Apply style fixes

bf56989

kashif approved these changes Feb 24, 2025

View reviewed changes

qgallouedec changed the title ~~Prevent applying the chat template to tokenized datasets~~ ✋ Prevent applying the chat template to tokenized datasets Feb 24, 2025

qgallouedec merged commit 4e0cf01 into huggingface:main Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

✋ Prevent applying the chat template to tokenized datasets #2939

✋ Prevent applying the chat template to tokenized datasets #2939

Uh oh!

DanFosing commented Feb 23, 2025 •

edited

Loading

Uh oh!

kashif commented Feb 23, 2025

Uh oh!

DanFosing commented Feb 23, 2025

Uh oh!

DanFosing commented Feb 23, 2025

Uh oh!

kashif commented Feb 24, 2025

Uh oh!

kashif commented Feb 24, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Feb 24, 2025

Uh oh!

qgallouedec commented Feb 24, 2025

Uh oh!

kashif commented Feb 24, 2025

Uh oh!

qgallouedec commented Feb 24, 2025

Uh oh!

qgallouedec commented Feb 24, 2025

Uh oh!

qgallouedec commented Feb 24, 2025

Uh oh!

qgallouedec commented Feb 24, 2025

Uh oh!

github-actions bot commented Feb 24, 2025

Uh oh!

Uh oh!

✋ Prevent applying the chat template to tokenized datasets #2939

✋ Prevent applying the chat template to tokenized datasets #2939

Uh oh!

Conversation

DanFosing commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kashif commented Feb 23, 2025

Uh oh!

DanFosing commented Feb 23, 2025

Uh oh!

DanFosing commented Feb 23, 2025

Uh oh!

kashif commented Feb 24, 2025

Uh oh!

kashif commented Feb 24, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Feb 24, 2025

Uh oh!

qgallouedec commented Feb 24, 2025

Uh oh!

kashif commented Feb 24, 2025

Uh oh!

qgallouedec commented Feb 24, 2025

Uh oh!

qgallouedec commented Feb 24, 2025

Uh oh!

qgallouedec commented Feb 24, 2025

Uh oh!

qgallouedec commented Feb 24, 2025

Uh oh!

github-actions bot commented Feb 24, 2025

Uh oh!

Uh oh!

DanFosing commented Feb 23, 2025 •

edited

Loading