🔧 Fix GRPO sampling logic #3725

qgallouedec · 2025-07-12T00:19:27Z

~~No idea how we ended up with this erroneous sampling logic, but after debugging in depth, I realize that GRPO is incorrect in several places. This PR fixes those problems.~~

EDIT: I was initially wrong, but there are still a couple of errors to fix

If it can help the review:

HuggingFaceDocBuilderDev · 2025-07-12T00:26:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-07-12T00:35:14Z

trl/trainer/grpo_config.py

-            Number of optimization steps per generation. If `None`, it defaults to gradient_accumulation_steps.
+            `per_device_train_batch_size * num_processes * steps_per_generation`. In other words, there is one
+            generation batch processed per optimization step. Mutually exclusive with `steps_per_generation`.
+        steps_per_generation: (`int` or `None`, *optional*, defaults to `None`):


- steps_per_generations: (`int` or `None`, *optional*, defaults to `None`): + steps_per_generation: (`int` or `None`, *optional*, defaults to `None`):

qgallouedec · 2025-07-12T00:35:43Z

trl/trainer/grpo_config.py

+            `per_device_train_batch_size * num_processes * steps_per_generation`. In other words, there is one
+            generation batch processed per optimization step. Mutually exclusive with `steps_per_generation`.
+        steps_per_generation: (`int` or `None`, *optional*, defaults to `None`):
+            Number of steps per generation. If `None`, it defaults to `gradient_accumulation_steps`. Mutually exclusive


It's not the number of opt step, but the number of steps.

qgallouedec · 2025-07-12T00:39:04Z

trl/trainer/grpo_config.py

        # Check if the effective batch size can be divided by the number of generations
        if self.num_generations < 2:
            raise ValueError(
                "GRPO requires at least 2 generations per prompt to calculate the advantages. You provided "
                f"{self.num_generations}, which is less than the minimum required."
            )
-        possible_values = [


The checks below aren't relevant anymore. The only test we need is

if self.generation_batch_size % (self.per_device_train_batch_size * num_processes) != 0: raise ValueError( f"generation_batch_size ({self.generation_batch_size}) must be divisible by the global batch size " f"({self.per_device_train_batch_size * num_processes})." )

qgallouedec · 2025-07-12T00:40:58Z

trl/trainer/grpo_trainer.py

@@ -179,7 +179,7 @@ def __iter__(self):
                        yield index

    def __len__(self) -> int:
-        return self.num_samples * self.mini_repeat_count * self.repeat_count
+        return (self.num_samples // self.batch_size) * self.batch_size * self.mini_repeat_count * self.repeat_count


Because of this (just above):

# [[2, 4, 3], [1, 0, 6], [5]] # -> [[2, 4, 3], [1, 0, 6]] indexes = [chunk for chunk in indexes if len(chunk) == self.batch_size]

qgallouedec · 2025-07-12T00:42:12Z

trl/trainer/grpo_trainer.py

+        # Keep logs sized to the generation batch to record only outputs from the latest model update.
        self._textual_logs = {
-            "prompt": deque(maxlen=maxlen),
-            "completion": deque(maxlen=maxlen),
-            "rewards": defaultdict(lambda: deque(maxlen=maxlen)),
-            "advantages": deque(maxlen=maxlen),
+            "prompt": deque(maxlen=args.generation_batch_size),
+            "completion": deque(maxlen=args.generation_batch_size),
+            "rewards": defaultdict(lambda: deque(maxlen=args.generation_batch_size)),
+            "advantages": deque(maxlen=args.generation_batch_size),


Equivalent but simplify by using generation_batch_size directly

qgallouedec · 2025-07-12T00:43:10Z

trl/trainer/grpo_trainer.py

@@ -819,7 +817,7 @@ def _get_train_sampler(self, dataset: Optional[Dataset] = None) -> Sampler:
            data_source=dataset,
            mini_repeat_count=self.num_generations,
            batch_size=self.args.generation_batch_size // self.num_generations,
-            repeat_count=self.num_iterations * self.args.steps_per_generation,
+            repeat_count=self.num_iterations,


I think this is a big one. No idea why we did it in the first place. There is no need to repeat the sampling self.args.steps_per_generation times

I think the original logic might be correct, because in _prepare_inputs, data is only sampled once every generate_every steps. For the remaining generate_every - 1 steps, we still need to load data. Therefore, multiplying by self.args.steps_per_generation here actually helps to avoid skipping any data.

This was what I was thinking when I made the change. Although perhaps there were some details of the sampling that I have misunderstood, which introduced the bug.

qgallouedec · 2025-07-12T00:43:54Z

trl/trainer/grpo_trainer.py

+            # If the generation and optimization steps are misaligned—i.e., if generation does not occur at the end of
+            # a full optimizer step (when gradient_accumulation_steps is not a multiple of generate_every)—then the
+            # samples may come from an earlier version of the model. In that case, we need to track old_per_token_logps
+            # for importance sampling. If the steps are aligned, importance sampling isn't necessary and we set
+            # old_per_token_logps to None.
+            generate_every = self.args.steps_per_generation * self.num_iterations  # generation frequency
+            if self.args.gradient_accumulation_steps % generate_every != 0:


Hoping the comment is enough for review

I'm still confused. The generate_every is for fixing distribution shift and I totally understand why it times steps_per_generation.
While in generating new samples, we have batch_size=self.args.generation_batch_size // self.num_generations and repeat_count=self.num_iterations * self.args.steps_per_generation. According to the document, generation_batch_size = per_device_train_batch_size * num_processes * steps_per_generation. It seems both parameters in RepeatSampler, batch_size and repeat_count, include steps_per_generation.
Is this what we really want? If yes, could you give more detailed explanation in comment? @qgallouedec

edbeeching · 2025-07-12T08:47:19Z

Would it be possible to write an integration test with dummy data to ensure this is now working as expected in a few different configurations?

kashif · 2025-07-14T14:23:54Z

i have been using:

    def test_repeat_sampler_length_calculation(self):
        dataset = Dataset.from_list([{"text": f"sample_{i}"} for i in range(10)])

        sampler = RepeatSampler(
            data_source=dataset,
            mini_repeat_count=2,
            batch_size=3,  # 10 samples / 3 batch_size = 3 complete batches, 1 sample dropped
            repeat_count=1,
            shuffle=False,
        )

        actual_samples = list(sampler)

        # With batch_size=3 and 10 samples, only 9 samples form complete batches (1 dropped)
        # So we expect 9 * 2 (mini_repeat_count) * 1 (repeat_count) = 18 samples
        expected_actual_samples = 18

        self.assertEqual(len(actual_samples), expected_actual_samples)
        self.assertEqual(
            len(sampler),
            len(actual_samples),
            f"RepeatSampler.__len__() should match actual iterations. "
            f"Expected {len(actual_samples)}, but got {len(sampler)}",
        )

    def test_grpo_sampling_no_steps_per_generation_multiplier(self):
        dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")

        with tempfile.TemporaryDirectory() as tmp_dir:
            config = GRPOConfig(
                output_dir=tmp_dir,
                per_device_train_batch_size=4,
                num_generations=8,
                num_iterations=2,
                steps_per_generation=4,  # Use steps_per_generation without generation_batch_size
                max_steps=1,  # Just test sampler creation
                report_to="none",
            )

            trainer = GRPOTrainer(
                model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
                reward_funcs="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",
                args=config,
                train_dataset=dataset,
            )

            # Get the train sampler
            sampler = trainer._get_train_sampler()

            # repeat_count = num_iterations, NOT num_iterations * steps_per_generation
            self.assertEqual(
                sampler.repeat_count,
                config.num_iterations,
                f"RepeatSampler.repeat_count should not include steps_per_generation multiplier. "
                f"Expected {config.num_iterations}, but got {sampler.repeat_count}",
            )

konstantinjdobler · 2025-07-16T09:25:27Z

Hey, after reading through this, I am not entirely clear on which cases were bugged before this PR. Could you give a quick overview which settings caused wrong sampling logic before this PR landed?

qgallouedec added 2 commits July 12, 2025 00:19

Fix GRPO

8eda636

fix sampler length

cb2cd13

qgallouedec changed the title ~~Fix GRPO sampling logic~~ 🔧 Fix GRPO sampling logic Jul 12, 2025

qgallouedec commented Jul 12, 2025

View reviewed changes

Add calculation for steps_per_generation in GRPOConfig

e900611

qgallouedec commented Jul 12, 2025

View reviewed changes

qgallouedec and others added 2 commits July 12, 2025 01:11

Calculate steps_per_generation

eefc1d4

Merge branch 'main' into fix-grpo-sampling

6550115

kashif added a commit to CompN3rd/trl that referenced this pull request Jul 14, 2025

integrate fixes from huggingface#3725

3a9f16a

kashif approved these changes Jul 14, 2025

View reviewed changes

qgallouedec added 2 commits July 14, 2025 23:40

add test case

84d260e

i was wrong + test

f50f078

kashif approved these changes Jul 15, 2025

View reviewed changes

edbeeching approved these changes Jul 15, 2025

View reviewed changes

qgallouedec merged commit 508d551 into main Jul 15, 2025
10 of 11 checks passed

qgallouedec deleted the fix-grpo-sampling branch July 15, 2025 20:39

marcandrelarochelle pushed a commit to marcandrelarochelle/trl that referenced this pull request Jul 29, 2025

🔧 Fix GRPO sampling logic (huggingface#3725)

11a3a69

qgallouedec mentioned this pull request Aug 19, 2025

Update grpo_trainer.py willccbb/verifiers#217

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔧 Fix GRPO sampling logic #3725

🔧 Fix GRPO sampling logic #3725

Uh oh!

qgallouedec commented Jul 12, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jul 12, 2025

Uh oh!

qgallouedec Jul 12, 2025

Uh oh!

qgallouedec Jul 12, 2025

Uh oh!

qgallouedec Jul 12, 2025 •

edited

Loading

Uh oh!

qgallouedec Jul 12, 2025 •

edited

Loading

Uh oh!

qgallouedec Jul 12, 2025

Uh oh!

qgallouedec Jul 12, 2025

Uh oh!

hjh0119 Jul 14, 2025

Uh oh!

edbeeching Jul 15, 2025

Uh oh!

qgallouedec Jul 12, 2025

Uh oh!

tangyd Jul 18, 2025 •

edited

Loading

Uh oh!

edbeeching commented Jul 12, 2025

Uh oh!

kashif commented Jul 14, 2025

Uh oh!

Uh oh!

konstantinjdobler commented Jul 16, 2025

Uh oh!

Uh oh!

🔧 Fix GRPO sampling logic #3725

🔧 Fix GRPO sampling logic #3725

Uh oh!

Conversation

qgallouedec commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jul 12, 2025

Uh oh!

qgallouedec Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

hjh0119 Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

edbeeching Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

tangyd Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edbeeching commented Jul 12, 2025

Uh oh!

kashif commented Jul 14, 2025

Uh oh!

Uh oh!

konstantinjdobler commented Jul 16, 2025

Uh oh!

Uh oh!

qgallouedec commented Jul 12, 2025 •

edited

Loading

qgallouedec Jul 12, 2025 •

edited

Loading

qgallouedec Jul 12, 2025 •

edited

Loading

tangyd Jul 18, 2025 •

edited

Loading