Skip to content

Conversation

LeonEricsson
Copy link
Collaborator

@LeonEricsson LeonEricsson commented Apr 21, 2025

Fixes #2998, #3333

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@LeonEricsson LeonEricsson marked this pull request as ready for review April 21, 2025 19:07
Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor things, and I think we're mostly good!

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thanks!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec qgallouedec changed the title [GRPO] Make training dataset shuffle optional 🎲 [GRPO] Make training dataset shuffle optional Apr 21, 2025
@qgallouedec qgallouedec merged commit 0dad4eb into huggingface:main Apr 21, 2025
9 checks passed
@Maghoumi
Copy link

@LeonEricsson

Does this actually work? Maybe there's a trick to getting this to work that I'm not aware of?

On my end, I sort my training dataset based on some criteria, then I set the shuffle_dataset=False when instantiating the GRPOTrainer. Also, I confirm that the dataset that is being passed to GRPOTrainer's constructor still has the order I sorted in.

But when the rewards are calculated, I see that the order of the inputs passed to my reward function is random as if it's shuffled.

@LeonEricsson
Copy link
Collaborator Author

LeonEricsson commented Aug 21, 2025

Does this actually work? Maybe there's a trick to getting this to work that I'm not aware of?

On my end, I sort my training dataset based on some criteria, then I set the shuffle_dataset=False when instantiating the GRPOTrainer. Also, I confirm that the dataset that is being passed to GRPOTrainer's constructor still has the order I sorted in.

But when the rewards are calculated, I see that the order of the inputs passed to my reward function is random as if it's shuffled.

Created a quick sanity script and things worked fine for me. The prompts in the reward function appeared in the expected sorted order

import random

from datasets import load_dataset

from trl import GRPOConfig, GRPOTrainer


# Load and sort dataset by a predictable field
dataset = load_dataset("trl-lib/tldr", split="train[:10]")

# sory by prompt length
prompt_lengths = [len(item["prompt"]) for item in dataset]
dataset = dataset.add_column("prompt_length", prompt_lengths)
dataset = dataset.sort("prompt_length", reverse=False)


def reward_random(prompts, completions, **kwargs):
    for p in prompts:
        print(len(p))
    return [random.random() for _ in range(len(completions))]


trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=reward_random,
    train_dataset=dataset,
    args=GRPOConfig(
        per_device_train_batch_size=2,
        num_generations=2,
        max_completion_length=128,
        max_steps=10,
        report_to="none",
        shuffle_dataset=False,
    ),
)

trainer.train()

@Maghoumi
Copy link

Thanks for your response and sharing your minimal code to reproduce. Your code helped me in two different ways:

  1. I was passing shuffle_dataset to GRPOTrainer (instead of passing it to GRPOConfig). Duh! I confirm it's working as expected once passed correctly.
  2. I detected a strange behavior in unsloth, where if you run your code on a machine with multiple GPUs (and all the GPUs are visible to the code), the behavior of the GRPOTrainer's sampler is different than if you run it with only a single GPU (or if only a single GPU is visible via CUDA_VISIBLE_DEVICES). This is despite the fact that unsloth only uses a single GPU for training. So, somehow the number of GPUs affect the batching logic. I will follow up with them on this.

Thanks again for your help.

@qgallouedec
Copy link
Member

I don't think it's unsloth but transformers' Trainer. When you have multiple GPU visible, you must use CUDA_VISIBLE_DEVICES

@LeonEricsson
Copy link
Collaborator Author

I don't think it's unsloth but transformers' Trainer. When you have multiple GPU visible, you must use CUDA_VISIBLE_DEVICES

yeah i've experienced this as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How can I set the dataset to not shuffle? It seems there is no such option. GRPOTrainer does not have a feature flag to prevent dataset shuffling
4 participants