♻️ Fix caching in SFT #2945

qgallouedec · 2025-02-24T09:14:29Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-02-24T09:19:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

edbeeching

LGTM. Out of curiousity. If I run two experiments with different max_length (and hence different packed dataset sizes), will both packed datasets be cached independently, so future exps will reuse the correct cached dataset, or does caching one overwrite the other?

qgallouedec · 2025-02-24T09:33:47Z

will both packed datasets be cached independently

Yes!

from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/Capybara", split="train[:10%]")

# Processes the dataset
training_args = SFTConfig(output_dir="Qwen/Qwen2.5-0.5B-SFT", max_length=128, packing=True)
trainer = SFTTrainer(
    args=training_args,
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
)

# Processes the dataset as well 
training_args = SFTConfig(output_dir="Qwen/Qwen2.5-0.5B-SFT", max_length=256, packing=True)
trainer = SFTTrainer(
    args=training_args,
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
)

# Uses the cache!
training_args = SFTConfig(output_dir="Qwen/Qwen2.5-0.5B-SFT", max_length=128, packing=True)
trainer = SFTTrainer(
    args=training_args,
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
)

Fix caching sft

6fe4cd2

qgallouedec requested review from kashif, edbeeching, lewtun and plaguss February 24, 2025 09:16

kashif approved these changes Feb 24, 2025

View reviewed changes

edbeeching approved these changes Feb 24, 2025

View reviewed changes

qgallouedec merged commit 3886147 into main Feb 24, 2025
13 of 14 checks passed

qgallouedec deleted the fix-caching-sft branch February 24, 2025 09:54

qgallouedec added a commit that referenced this pull request Feb 25, 2025

♻️ Fix caching in SFT (#2945)

51d383e

edbeeching mentioned this pull request Feb 26, 2025

is it normal that src/open_r1/sft.py performs tokenizing and packing of the dataset every time I run the script ? huggingface/open-r1#435

Closed

jhinpan pushed a commit to jhinpan/trl-jin that referenced this pull request Mar 12, 2025

♻️ Fix caching in SFT (huggingface#2945)

b91a654

yxliu-TAMU pushed a commit to mincheolseong/ECEN743-GRPO-Project-Proposal that referenced this pull request Apr 20, 2025

♻️ Fix caching in SFT (huggingface#2945)

1f4ef57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

♻️ Fix caching in SFT #2945

♻️ Fix caching in SFT #2945

Uh oh!

qgallouedec commented Feb 24, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Feb 24, 2025

Uh oh!

edbeeching left a comment

Uh oh!

qgallouedec commented Feb 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

♻️ Fix caching in SFT #2945

♻️ Fix caching in SFT #2945

Uh oh!

Conversation

qgallouedec commented Feb 24, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Feb 24, 2025

Uh oh!

edbeeching left a comment

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Feb 24, 2025 •

edited

Loading