🎲 [GRPO] Make training dataset shuffle optional #3334

LeonEricsson · 2025-04-21T17:01:57Z

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

tests/test_grpo_trainer.py

trl/trainer/grpo_trainer.py

tests/test_grpo_trainer.py

qgallouedec

Just some minor things, and I think we're mostly good!

trl/trainer/grpo_trainer.py

qgallouedec

lgtm! thanks!

HuggingFaceDocBuilderDev · 2025-04-21T20:26:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Maghoumi · 2025-08-17T04:19:43Z

@LeonEricsson

Does this actually work? Maybe there's a trick to getting this to work that I'm not aware of?

On my end, I sort my training dataset based on some criteria, then I set the shuffle_dataset=False when instantiating the GRPOTrainer. Also, I confirm that the dataset that is being passed to GRPOTrainer's constructor still has the order I sorted in.

But when the rewards are calculated, I see that the order of the inputs passed to my reward function is random as if it's shuffled.

LeonEricsson · 2025-08-21T05:55:18Z

Does this actually work? Maybe there's a trick to getting this to work that I'm not aware of?

On my end, I sort my training dataset based on some criteria, then I set the shuffle_dataset=False when instantiating the GRPOTrainer. Also, I confirm that the dataset that is being passed to GRPOTrainer's constructor still has the order I sorted in.

But when the rewards are calculated, I see that the order of the inputs passed to my reward function is random as if it's shuffled.

Created a quick sanity script and things worked fine for me. The prompts in the reward function appeared in the expected sorted order

import random

from datasets import load_dataset

from trl import GRPOConfig, GRPOTrainer


# Load and sort dataset by a predictable field
dataset = load_dataset("trl-lib/tldr", split="train[:10]")

# sory by prompt length
prompt_lengths = [len(item["prompt"]) for item in dataset]
dataset = dataset.add_column("prompt_length", prompt_lengths)
dataset = dataset.sort("prompt_length", reverse=False)


def reward_random(prompts, completions, **kwargs):
    for p in prompts:
        print(len(p))
    return [random.random() for _ in range(len(completions))]


trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=reward_random,
    train_dataset=dataset,
    args=GRPOConfig(
        per_device_train_batch_size=2,
        num_generations=2,
        max_completion_length=128,
        max_steps=10,
        report_to="none",
        shuffle_dataset=False,
    ),
)

trainer.train()

Maghoumi · 2025-08-22T05:38:32Z

Thanks for your response and sharing your minimal code to reproduce. Your code helped me in two different ways:

I was passing shuffle_dataset to GRPOTrainer (instead of passing it to GRPOConfig). Duh! I confirm it's working as expected once passed correctly.
I detected a strange behavior in unsloth, where if you run your code on a machine with multiple GPUs (and all the GPUs are visible to the code), the behavior of the GRPOTrainer's sampler is different than if you run it with only a single GPU (or if only a single GPU is visible via CUDA_VISIBLE_DEVICES). This is despite the fact that unsloth only uses a single GPU for training. So, somehow the number of GPUs affect the batching logic. I will follow up with them on this.

Thanks again for your help.

qgallouedec · 2025-08-22T13:44:49Z

I don't think it's unsloth but transformers' Trainer. When you have multiple GPU visible, you must use CUDA_VISIBLE_DEVICES

LeonEricsson · 2025-08-22T19:50:43Z

I don't think it's unsloth but transformers' Trainer. When you have multiple GPU visible, you must use CUDA_VISIBLE_DEVICES

yeah i've experienced this as well

LeonEricsson added 2 commits April 21, 2025 18:26

option to disable dataset shuffle

a35ac01

wip test cases

ae77f30

qgallouedec reviewed Apr 21, 2025

View reviewed changes

tests/test_grpo_trainer.py Outdated Show resolved Hide resolved

qgallouedec reviewed Apr 21, 2025

View reviewed changes

tests/test_grpo_trainer.py Outdated Show resolved Hide resolved

qgallouedec reviewed Apr 21, 2025

View reviewed changes

tests/test_grpo_trainer.py Outdated Show resolved Hide resolved

qgallouedec reviewed Apr 21, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Show resolved Hide resolved

LeonEricsson and others added 2 commits April 21, 2025 20:34

wip test cases

24b2b6d

simplified sampler test cases

0edab92

LeonEricsson commented Apr 21, 2025

View reviewed changes

tests/test_grpo_trainer.py Outdated Show resolved Hide resolved

LeonEricsson marked this pull request as ready for review April 21, 2025 19:07

deprecation warning for RepeatRandomSampler

39a34f5

qgallouedec reviewed Apr 21, 2025

View reviewed changes

tests/test_grpo_trainer.py Show resolved Hide resolved

qgallouedec reviewed Apr 21, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

revert test cases, keeping one no shuffle test

01a2f5e

qgallouedec approved these changes Apr 21, 2025

View reviewed changes

style

f43d926

qgallouedec linked an issue Apr 21, 2025 that may be closed by this pull request

How can I set the dataset to not shuffle? It seems there is no such option. #3333

Closed

qgallouedec changed the title ~~[GRPO] Make training dataset shuffle optional~~ 🎲 [GRPO] Make training dataset shuffle optional Apr 21, 2025

qgallouedec merged commit 0dad4eb into huggingface:main Apr 21, 2025
9 checks passed

sidmadala mentioned this pull request Apr 21, 2025

AxolotlGRPOTrainer still shuffles combined datasets even with curriculum_sampling flag enabled axolotl-ai-cloud/axolotl#2376

Open

8 tasks

hjh0119 mentioned this pull request Apr 23, 2025

updates GRPOTrainer compatible with trl 0.17 modelscope/ms-swift#3969

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🎲 [GRPO] Make training dataset shuffle optional #3334

🎲 [GRPO] Make training dataset shuffle optional #3334

Uh oh!

LeonEricsson commented Apr 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec left a comment

Uh oh!

Uh oh!

qgallouedec left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 21, 2025

Uh oh!

Uh oh!

Maghoumi commented Aug 17, 2025

Uh oh!

LeonEricsson commented Aug 21, 2025 •

edited

Loading

Uh oh!

Maghoumi commented Aug 22, 2025

Uh oh!

qgallouedec commented Aug 22, 2025

Uh oh!

LeonEricsson commented Aug 22, 2025

Uh oh!

Uh oh!

🎲 [GRPO] Make training dataset shuffle optional #3334

🎲 [GRPO] Make training dataset shuffle optional #3334

Uh oh!

Conversation

LeonEricsson commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before submitting

Who can review?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 21, 2025

Uh oh!

Uh oh!

Maghoumi commented Aug 17, 2025

Uh oh!

LeonEricsson commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Maghoumi commented Aug 22, 2025

Uh oh!

qgallouedec commented Aug 22, 2025

Uh oh!

LeonEricsson commented Aug 22, 2025

Uh oh!

Uh oh!

LeonEricsson commented Apr 21, 2025 •

edited

Loading

LeonEricsson commented Aug 21, 2025 •

edited

Loading