-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Documentation dataset format #2020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really awesome docs @qgallouedec - thanks for bringing some order to the chaos of dataset formats ❤️ !
Everything LGTM, with one main question about what is meant by "standard dataset format". What I'm wondering in particular is whether we expect users to preformat their datasets for each trainer or whether we accept some formats like messages
as column that are automatically formatted in our scripts.
Apart from this, feel free to merge with the nits!
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Yes... "standard" is in fact "non-conversational". But it's weird to define something by something it's not.
Currently we expect users to preformat their datasets. But I think supporting everything in the trainers can make sense: from datasets import Dataset
standard_dataset = Dataset.from_dict(
{
"prompt": ["The sky is", "The sun is"],
"completion": [" blue.", " in the sky."],
}
)
AnyTrlTrainer(..., train_dataset=standard_dataset) # ok
conversational_dataset = Dataset.from_dict({
"prompt": [
[{"role": "user", "content": "What color is the sky?"}], [{"role": "user", "content": "Where is the sun?"}],
],
"completion": [
[{"role": "assistant", "content": "It is blue."}], [{"role": "assistant", "content": "In the sky."}],
],
})
AnyTrlTrainer(..., train_dataset=conversational_dataset) # currently not ok, but we can support it in the future. |
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Ah, that makes sense - thanks for the clarification. One thing I'm wondering is whether there's any real advantage to support standard vs conversational, since technically all standard datasets can be converted to conversational by wrapping the prompt, completion etc in the messages format. Given that most models are chat models, one approach would be to support conversational datasets natively in the trainers, but allow users to also provide a preprocessed / tokenized dataset if they wish more flexibility. This also is related to #1646 |
* first piece of doc * improve readibility * some data utils and doc * simplify prompt-only * format * fix path data utils * fix example format * simplify * tests * prompt-completion * update antropic hh * update dataset script * implicit prompt * additional content * `maybe_reformat_dpo_to_kto` -> `unpair_preference_dataset` * Preference dataset with implicit prompt * unpair preference dataset tests * documentation * ... * doc * changes applied to dpo example * better doc and better log error * a bit more doc * improve doc * converting * some subsections * converting section * further refinements * tldr * tldr preference * rename * lm-human-preferences-sentiment * `imdb` to `stanfordnlp/imdb` * Add script for LM human preferences descriptiveness * Remove sentiment_descriptiveness.py script * style * example judge tlrd with new dataset * Syle * Dataset conversion for TRL compatibility * further refinements * trainers in doc * top level for functions * stanfordnlp/imdb * downgrade transformers * temp reduction of tests * next commit * next commit * additional content * proper tick format * precise the assistant start token * improve * lower case * Update titles in _toctree.yml and data_utils.mdx * revert make change * correct dataset ids * expand a bit dataset formats * skip gated repo tests * data utilities in API * Update docs/source/dataset_formats.mdx Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update docs/source/dataset_formats.mdx Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update docs/source/dataset_formats.mdx Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update docs/source/dataset_formats.mdx Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * tiny internal testing for chat template testing * precise type/format * exlude sft trainer in doc * Update trl/trainer/utils.py * XPO in the doc --------- Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Help for review
New section in the doc
New data utils
maybe_apply_chat_template
,maybe_extract_prompt
,maybe_unpair_preference_dataset
,apply_chat_template
,extract_prompt
andunpair_preference_dataset
.Update dataset example files
As explained in this section we provide script for converting dataset to TRL style.
The goal is, for every dataset in
trl-lib
to have its corresponding script inexamples/datasets
examples/datasets
to closely match the documented format.Start to update some example scripts
We have to update all scripts to make sure they comply with the new style.
I prefer dedicating further PR to update them all
SIMPLE_QUERY_CHAT_TEMPLATE
to include a prompt generation