🍟 [SFT] Handles the dataset if it has been preprocessed #2863

BenasdTW · 2025-02-14T15:00:42Z

What does this PR do?

Adding back the code in SFTTrainer that properly handles the dataset if it has been preprocessed.
This part of the code was removed in #2405, but it should not have been removed.

        # If the dataset is already preprocessed (tokenized), return as-is. Only works if dataset is
        # a datasets.Dataset or datasets.IterableDataset -- not for torch Dataset
        column_names = (
            dataset.column_names if isinstance(dataset, (Dataset, IterableDataset)) else None
        )
        if column_names and "input_ids" in column_names:
            if formatting_func is not None:
                warnings.warn(
                    "You passed a dataset that is already processed (contains an `input_ids` field) together with a "
                    "valid formatting function. Therefore `formatting_func` will be ignored. Either remove the "
                    "`formatting_func` or pass a dataset that is not already processed.",
                    UserWarning,
                )
            if not packing:
                return dataset

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

trl/trainer/sft_trainer.py

qgallouedec · 2025-02-14T16:30:04Z

Thanks for fixing it!
I made a few comments. Can you also add a test?

BenasdTW · 2025-02-14T22:52:07Z

Added test: test_sft_trainer_directly_with_pretokenized_data

root@813ecfcb235b:/workspaces/LLMTrain/trl# python -m pytest tests/test_sft_trainer.py::SFTTrainerTester::test_sft_trainer_directly_with_pretokenized_data
=================================================== test session starts ====================================================
platform linux -- Python 3.11.10, pytest-8.3.4, pluggy-1.5.0
rootdir: /workspaces/LLMTrain/trl
configfile: pyproject.toml
plugins: hypothesis-6.115.5, rerunfailures-15.0, cov-6.0.0, xdist-3.6.1, anyio-4.8.0
collected 1 item                                                                                                           

tests/test_sft_trainer.py .                                                                                          [100%]

==================================================== 1 passed in 18.36s ====================================================

tests/test_sft_trainer.py

HuggingFaceDocBuilderDev · 2025-02-17T06:56:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

trl/trainer/sft_trainer.py

qgallouedec

Thanks @BenasdTW ! can be merged as soon as the CI is green

trl/trainer/sft_trainer.py

* return dataset if it's preprocessed * add is_processed flag variable * add test * move test_sft_trainer_directly_with_pretokenized_data to Tester2 * Update sft_trainer.py * no need for padding and truncation * minor reorganization * Update trl/trainer/sft_trainer.py * let the collator pad * style * fix tests --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>

hanyin88 · 2025-02-20T22:17:13Z

Thanks for the kind contribution. Wondering why

def tokenize(ex):
                    tokenized = processing_class(ex[args.dataset_text_field])
                    return {"input_ids": tokenized["input_ids"], "attention_mask": tokenized["attention_mask"]}

dataset = dataset.map(tokenize, **map_kwargs)

would generate new dataset fingerprint now, and not be able to utilize prior cache? All prior steps could utilize the cache with same fingerprint without problem. Many thanks.

…2863) * return dataset if it's preprocessed * add is_processed flag variable * add test * move test_sft_trainer_directly_with_pretokenized_data to Tester2 * Update sft_trainer.py * no need for padding and truncation * minor reorganization * Update trl/trainer/sft_trainer.py * let the collator pad * style * fix tests --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>

return dataset if it's preprocessed

d391baa

BenasdTW mentioned this pull request Feb 14, 2025

Cannot run SFTTrainer with tokenized data after updating TRL. #2861

Closed

5 tasks

qgallouedec reviewed Feb 14, 2025

View reviewed changes

trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved

qgallouedec reviewed Feb 14, 2025

View reviewed changes

trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved

qgallouedec reviewed Feb 14, 2025

View reviewed changes

trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved

BenasdTW added 2 commits February 14, 2025 22:18

add is_processed flag variable

6f9535e

add test

97b9b15

qgallouedec reviewed Feb 15, 2025

View reviewed changes

tests/test_sft_trainer.py Outdated Show resolved Hide resolved

BenasdTW and others added 2 commits February 15, 2025 11:21

move test_sft_trainer_directly_with_pretokenized_data to Tester2

87f28f7

Merge branch 'main' into fix-2861

3d4179d

qgallouedec reviewed Feb 17, 2025

View reviewed changes

trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved

qgallouedec and others added 4 commits February 17, 2025 08:01

Update sft_trainer.py

75f49b5

no need for padding and truncation

15ae265

minor reorganization

1e41e58

Merge branch 'main' into fix-2861

59efa7e

qgallouedec approved these changes Feb 17, 2025

View reviewed changes

qgallouedec reviewed Feb 17, 2025

View reviewed changes

trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved

qgallouedec and others added 5 commits February 17, 2025 12:24

Update trl/trainer/sft_trainer.py

d07c300

Merge branch 'main' into fix-2861

340cce2

Merge branch 'main' into fix-2861

df29a66

let the collator pad

fe1f0c0

style

4f70d56

qgallouedec changed the title ~~[SFT] Handles the dataset if it has been preprocessed~~ 🍟 [SFT] Handles the dataset if it has been preprocessed Feb 17, 2025

kashif reviewed Feb 17, 2025

View reviewed changes

trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved

kashif added 2 commits February 18, 2025 08:50

fix tests

5b76661

Merge branch 'main' into fix-2861

96b5e1a

qgallouedec merged commit aafd8cb into huggingface:main Feb 18, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🍟 [SFT] Handles the dataset if it has been preprocessed #2863

🍟 [SFT] Handles the dataset if it has been preprocessed #2863

BenasdTW commented Feb 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Feb 14, 2025

Uh oh!

BenasdTW commented Feb 14, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 17, 2025

Uh oh!

Uh oh!

qgallouedec left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hanyin88 commented Feb 20, 2025

Uh oh!

Uh oh!

🍟 [SFT] Handles the dataset if it has been preprocessed #2863

🍟 [SFT] Handles the dataset if it has been preprocessed #2863

Conversation

BenasdTW commented Feb 14, 2025

What does this PR do?

Who can review?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Feb 14, 2025

Uh oh!

BenasdTW commented Feb 14, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 17, 2025

Uh oh!

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hanyin88 commented Feb 20, 2025

Uh oh!

Uh oh!