Add split argument to required builders and set it default value to "train" #1783

krammnic · 2024-10-09T12:39:41Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.
#1770

Changelog

What are the changes made in this PR?

Add split argument to required builders and set it default value to "train"

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2024-10-09T12:39:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1783

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0b7ea86 with merge base 209d55d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-10-09T12:39:47Z

Hi @krammnic!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

krammnic · 2024-10-09T12:45:47Z

@joecummings @RdoubleA Require review

facebook-github-bot · 2024-10-09T12:54:21Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

joecummings

This generally looks good, but can you include some output where you attempt to run on a dataset without specifying train and it starts without any errors?

krammnic · 2024-10-09T13:51:45Z

For multimodal:

from torchtune.datasets.multimodal import multimodal_chat_dataset


ds = multimodal_chat_dataset(
    model_transform=None,
    source="Lin-Chen/ShareGPT4V",
    split="train",
    name="ShareGPT4V",
    image_dir="/home/user/dataset/",
    image_tag="<image>",
)

Output:

README.md: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.22k/2.22k [00:00<00:00, 10.9MB/s]
(…)egpt4v_instruct_gpt4-vision_cap100k.json: 100%|██████████████████████████████████████████████████████████████████████████████| 134M/134M [00:02<00:00, 56.2MB/s]
Generating train split: 100%|████████████████████████████████████████████████████████████████████████████████████| 102025/102025 [00:02<00:00, 49778.44 examples/s]

For instruct:

from torchtune.datasets import instruct_dataset

dataset = instruct_dataset(
    tokenizer=None,
    source="json",
    data_files="my_dataset.json",
    column_map={
         "input": "question",
         "output": "answer",
    },
    train_on_input=False,
    packed=False,
)

my_dataset.json

[
    {
        "question": "What time is it in London?",
        "answer": "It is 10:00 AM in London.",
    }
]

Output:

Generating train split: 1 examples [00:00, 122.93 examples/s]

For chat:

from torchtune.datasets import chat_dataset

dataset = chat_dataset(
    tokenizer=None,
    source="json",
    data_files="my_dataset.json",
    conversation_column="conversations",
    conversation_style="sharegpt",
    train_on_input=False,
    packed=False
)

my_dataset.json

[
    {
        "conversations": [
            {
                "from": "human",
                "value": "What time is it in London?",
            },
            {
                "from": "gpt",
                "value": "It is 10:00 AM in London.",
            }
        ]
    }
]

Output:

Generating train split: 1 examples [00:00, 111.08 examples/s]

Should it be covered with test?

krammnic · 2024-10-09T13:58:30Z

@joecummings Has provided an examples, but it definitely should be find cause the same argument was passed in **kwargs of SFTDataset

RdoubleA

This is excellent, thank you so much for adding this!

…train" (pytorch#1783) Co-authored-by: Mark Obozov <obozovmark9@gmail.com>

AsixDO and others added 2 commits October 9, 2024 15:27

Set split=train as default to all datasets

9cc4023

Add split argument in docsting

0b7ea86

krammnic changed the title ~~Add split argument to required builders and set it default value to "train~~ Add split argument to required builders and set it default value to "train" Oct 9, 2024

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 9, 2024

joecummings reviewed Oct 9, 2024

View reviewed changes

RdoubleA approved these changes Oct 9, 2024

View reviewed changes

RdoubleA merged commit a04588e into pytorch:main Oct 9, 2024
17 checks passed

mori360 pushed a commit to mori360/torchtune that referenced this pull request Oct 14, 2024

Add split argument to required builders and set it default value to "…

ee3e703

…train" (pytorch#1783) Co-authored-by: Mark Obozov <obozovmark9@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add split argument to required builders and set it default value to "train" #1783

Add split argument to required builders and set it default value to "train" #1783

Uh oh!

krammnic commented Oct 9, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 9, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Oct 9, 2024

Uh oh!

krammnic commented Oct 9, 2024

Uh oh!

facebook-github-bot commented Oct 9, 2024

Uh oh!

joecummings left a comment

Uh oh!

krammnic commented Oct 9, 2024 •

edited

Loading

Uh oh!

krammnic commented Oct 9, 2024

Uh oh!

RdoubleA left a comment

Uh oh!

Uh oh!

Uh oh!

Add split argument to required builders and set it default value to "train" #1783

Add split argument to required builders and set it default value to "train" #1783

Uh oh!

Conversation

krammnic commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changelog

Test plan

UX

Uh oh!

pytorch-bot bot commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1783

✅ No Failures

Uh oh!

facebook-github-bot commented Oct 9, 2024

Action Required

Process

Uh oh!

krammnic commented Oct 9, 2024

Uh oh!

facebook-github-bot commented Oct 9, 2024

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

krammnic commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krammnic commented Oct 9, 2024

Uh oh!

RdoubleA left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

krammnic commented Oct 9, 2024 •

edited

Loading

pytorch-bot bot commented Oct 9, 2024 •

edited

Loading

krammnic commented Oct 9, 2024 •

edited

Loading