📨 [SFT] Tokenize directly when applying the chat template #3572
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does
This PR simplifies how we tokenize conversational data. Previously, we used:
Now, we directly do:
Why
This change enables future support for
return_assistant_tokens_mask
, which is useful for training on assistant-only tokens.Impact
User-facing: No changes.
Internally: Minimal changes, mostly cosmetic (see table below):
attention_mask
is no longer added by default—it was always filled with 1s and later replaced by the collator."messages"
column is now preserved whenremove_unused_columns=False
, rather than being replaced with"text"
. This improves clarity.Column changes summary
main
"text"
["input_ids", "attention_mask", "text"]
"text"
["input_ids", "text"]
main
"messages"
["input_ids", "attention_mask", "text"]
"messages"
["input_ids", "messages"]
Functional Equivalence
All previous dataset preparation workflows remain functionally equivalent. A full equivalence test (including token IDs,
position_ids
, andcompletion_mask
when applicable) has been run across configurations like:No regressions were found.