feat: add OpenAI format dataset for SFT #485

AtsunoriFujita · 2025-06-05T07:09:41Z

What does this PR do ?

This PR enables using the OpenAI format dataset from a json/jsonl when running SFT.

Issues

List issues that this PR closes (syntax):

Usage

Modify examples/configs/sft.yaml

 data:
  max_input_seq_length: ${policy.max_total_sequence_length}
  dataset_name: "openai_format"
  add_bos: true
  add_eos: true
  add_generation_prompt: false
  train_data_path: "datasets/train.json"
  val_data_path: "datasets/valid.json"
  chat_key: "messages"  # set chat_key like "messages" and "conversations"
  system_key: null  # (optional) set system_key (system_prompt key name) if dataset has
  system_prompt: null  # (optional) set system_prompt, e.g., "You are a helpful assistant." If system_key is not null, it will be prioritized.

Run SFT job

uv run examples/run_sft.py --config examples/configs/sft.yaml

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Signed-off-by: Atsunori Fujita <afujita@nvidia.com>

ashors1

Thanks for the PR! Could we add some documentation on this class and how it differs from prompt_response_dataset.py? Do you think it would make sense to consider merging this class with PromptResponseDataset?

Signed-off-by: Atsunori Fujita <afujita@nvidia.com>

AtsunoriFujita · 2025-06-18T11:10:15Z

Hi @ashors1, added docstrings and unit tests.

Could we add some documentation on this class and how it differs from prompt_response_dataset.py?

The openai_format deals with a single key (like messages or conversations) and single or multiple turns (user, assistant), similar to HuggingFaceTB/smoltalk.

ashors1

Looks good! Thank you for the contribution!

terrykong · 2025-06-18T22:04:25Z

@AtsunoriFujita could you run pre-commit on your change? It fails our linter job

Signed-off-by: Atsunori Fujita <afujita@nvidia.com>

AtsunoriFujita · 2025-06-18T23:53:30Z

Hi @terrykong, applied pre-commit.

terrykong · 2025-06-18T23:56:24Z

@AtsunoriFujita do you mind putting an example run command in the description so that users finding this PR can learn how to use this?

AtsunoriFujita · 2025-06-19T01:44:39Z

@terrykong, thank you. I added it.

Signed-off-by: Atsunori Fujita <afujita@nvidia.com>

Add OpenAI format dataset for SFT

236bbc7

Signed-off-by: Atsunori Fujita <afujita@nvidia.com>

terrykong requested a review from ashors1 June 5, 2025 16:28

ashors1 reviewed Jun 17, 2025

View reviewed changes

Add class docstrings and unit test

b5343be

Signed-off-by: Atsunori Fujita <afujita@nvidia.com>

ashors1 previously approved these changes Jun 18, 2025

View reviewed changes

terrykong changed the title ~~Add OpenAI format dataset for SFT~~ feat: add OpenAI format dataset for SFT Jun 18, 2025

Apply pre-commit

8a966bc

Signed-off-by: Atsunori Fujita <afujita@nvidia.com>

AtsunoriFujita dismissed ashors1’s stale review via 8a966bc June 18, 2025 23:52

terrykong approved these changes Jun 18, 2025

View reviewed changes

parthchadha approved these changes Jul 2, 2025

View reviewed changes

parthchadha added this pull request to the merge queue Jul 2, 2025

Merged via the queue into NVIDIA-NeMo:main with commit d3c58f0 Jul 2, 2025
14 of 15 checks passed

xxman-google pushed a commit to xxman-google/NeMo-RL that referenced this pull request Jul 2, 2025

feat: add OpenAI format dataset for SFT (NVIDIA-NeMo#485)

ddac07c

Signed-off-by: Atsunori Fujita <afujita@nvidia.com>

AtsunoriFujita deleted the afujita/add_openai_format_sft branch July 3, 2025 00:10

therealnaveenkamal pushed a commit to therealnaveenkamal/RL that referenced this pull request Jul 7, 2025

feat: add OpenAI format dataset for SFT (NVIDIA-NeMo#485)

ac2e4e6

Signed-off-by: Atsunori Fujita <afujita@nvidia.com>

YzjiaoNvd pushed a commit to YzjiaoNvd/NeMo-RL that referenced this pull request Jul 14, 2025

feat: add OpenAI format dataset for SFT (NVIDIA-NeMo#485)

e810c24

Signed-off-by: Atsunori Fujita <afujita@nvidia.com>

KiddoZhu pushed a commit that referenced this pull request Jul 28, 2025

feat: add OpenAI format dataset for SFT (#485)

100a7f5

Signed-off-by: Atsunori Fujita <afujita@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add OpenAI format dataset for SFT #485

feat: add OpenAI format dataset for SFT #485

Uh oh!

AtsunoriFujita commented Jun 5, 2025 •

edited

Loading

Uh oh!

ashors1 left a comment

Uh oh!

AtsunoriFujita commented Jun 18, 2025

Uh oh!

ashors1 left a comment

Uh oh!

terrykong commented Jun 18, 2025

Uh oh!

AtsunoriFujita commented Jun 18, 2025

Uh oh!

terrykong commented Jun 18, 2025

Uh oh!

AtsunoriFujita commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

feat: add OpenAI format dataset for SFT #485

feat: add OpenAI format dataset for SFT #485

Uh oh!

Conversation

AtsunoriFujita commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

ashors1 left a comment

Choose a reason for hiding this comment

Uh oh!

AtsunoriFujita commented Jun 18, 2025

Uh oh!

ashors1 left a comment

Choose a reason for hiding this comment

Uh oh!

terrykong commented Jun 18, 2025

Uh oh!

AtsunoriFujita commented Jun 18, 2025

Uh oh!

terrykong commented Jun 18, 2025

Uh oh!

AtsunoriFujita commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

AtsunoriFujita commented Jun 5, 2025 •

edited

Loading