💔 [GRPO] Decouple gradient accumulation from the number of minibatches generated #3388

edbeeching · 2025-04-29T09:29:53Z

This PR decouples the size of the effective batch from the gradient accumulation steps with a new argument num_mini_batches.

By default num_mini_batches = gradient_accumulation_steps

A few caveats:

The mini-batches are grouped in prompt order, a subsequent PR will shuffle the accumulated_local_batch before adding it to the buffer. PR 🎲 [GRPO] Shuffle mini batches #3391
The minibatches are still padded to the length of the longest generation, a subsequent PR will fix this
If num_iterations > 1 the buffer indices should be permuted / reshuffled in order to ensure that samples are taken randomly.

HuggingFaceDocBuilderDev · 2025-04-29T09:34:06Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lewtun

LGTM with some nits!

trl/trainer/grpo_config.py

lewtun · 2025-04-29T12:48:00Z

As discussed offline, it would be interesting to check that the clip_ratio is indeed non-zero when num_mini_batches != gas (could possibly be part of the integration tests?)

…inibatches

trl/trainer/grpo_config.py

qgallouedec · 2025-04-29T16:37:03Z

trl/trainer/grpo_config.py

@@ -61,6 +61,8 @@ class GRPOConfig(TrainingArguments):
            with vLLM generation.
        shuffle_dataset (`bool`, *optional*, defaults to `True`):
            Whether to shuffle the training dataset.
+        num_mini_batches: (`int`, *optional*, defaults to `None`):


I would find it clearer to call this parameter generate_every or something similar. I can't figure out what a mini-batch is in this case. Named like that, and with the doc ("split"), I expect there to be a division by that number later, like mini_batch_size = batch_size // num_mini_batches.

For context, the mini-batch terminology used here is the same as done in e.g. DAPO:

In other words, @edbeeching is partitioning the effective / generation batch into num_mini_batches subsets to compute the loss. It takes the model slightly off-policy but allows one to generate a large batch and optimise smaller chunks without going OOM

I see. Although this is technically correct and corresponds to the paper, I have the impression that it's a bit misleading, insofar as, as a user, I imagine that by increasing num_mini_batches, I'm consequently decreasing the size of the mini-batch and increase the number of gradient updates per rollout step.
And, the sentence in the paper is also constructed like this: "the mini-batch size is set to 512, i.e., 16 gradient updates for each rollout step."

But that's not really what happens here. But if you don't find it misleading, we can leave it like that, just document it better.

To clarify the DAPO example above, they have 512 unique prompts per batch and then sample 16 completions per prompt to obtain a generation batch size of 512 x 16 = 8'192.

They then partition the generation batch size into mini-batches of size 512, hence the 16 gradient updates per rollout step.

Now suppose I have 8 GPUs and per_device_train_batch_size=16, then I believe the corresponding setting in trl would be:

gradient_accumulation_steps = 64 (to get 8,192 = 8 x 16 x 64)

num_mini_batches = 16 (i.e. 16 gradient updates with slices of 512 samples)

If @edbeeching agrees with my logic then indeed it would make sense to document how (a) the number of unique prompts is computed and (b) how mini-batching and gas can be tuned to obtain large generation batch sizes

@lewtun in general I agree, but I am not sure if gradient_accumulation_steps = 64 in the case of the DAPO example details above.

To summarize how I see it, for each generation step:

There will be 16 optimization steps (gradient updates) in total.

Each step is with 512 prompt-completions pairs

The total generation batch size is 512*16=8192

per_device_train_batch_size is unknown, but lets assume it is 16 and num_gpus=8.

Which would mean per_device_train_batch_size=16 and gradient_accumulation_steps=4 as 16*8*4=512

In this case the num_mini_batches would be:

generation_batch_size / (num_gpus*per_device_train_batch_size) = 64

This is also equal to the num_optmization_steps* grad_acc_steps

Ah yes indeed I made a mistake in my above reasoning - I agree with @edbeeching.

We discussed offline and also agree it is confusing to have num_mini_batches affect the generation batch size.

trl/trainer/grpo_trainer.py

…ibatches generated (#3396)

qgallouedec

LGTM! Another cool optimization!
For the record, it may be more intuitive for the user to set steps_per_generation directly (and not generation_batch_size and then calculate steps_per_generation) but you're the one who's been playing with the code over the last few days so you know better what's more intuitive. I'm happy with both, so feel free to merge either.

trl/trainer/grpo_config.py

lewtun · 2025-05-05T15:20:22Z

LGTM! Another cool optimization! For the record, it may be more intuitive for the user to set steps_per_generation directly (and not generation_batch_size and then calculate steps_per_generation) but you're the one who's been playing with the code over the last few days so you know better what's more intuitive. I'm happy with both, so feel free to merge either.

I personally prefer steps_per_generation as this is more explicit to the user than generation_batch_size

edbeeching · 2025-05-06T07:03:31Z

I personally prefer steps_per_generation as this is more explicit to the user than generation_batch_size

I think both are useful, so I will make it so the user can set either option, but not both.

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

edbeeching added 2 commits April 29, 2025 09:23

adds num_minibatches

7323f1e

precommit

62f30e1

edbeeching requested review from lewtun, qgallouedec, shirinyamani and kashif April 29, 2025 09:30

fix comment

4652699

lewtun approved these changes Apr 29, 2025

View reviewed changes

trl/trainer/grpo_config.py Outdated Show resolved Hide resolved

trl/trainer/grpo_config.py Outdated Show resolved Hide resolved

trl/trainer/grpo_config.py Outdated Show resolved Hide resolved

trl/trainer/grpo_config.py Outdated Show resolved Hide resolved

edbeeching added 2 commits April 29, 2025 12:48

nits

0489572

fixes a few lines where grad_acc should have been replaced with num_m…

0bc313f

…inibatches

qgallouedec reviewed Apr 29, 2025

View reviewed changes

trl/trainer/grpo_config.py Outdated Show resolved Hide resolved

qgallouedec reviewed Apr 29, 2025

View reviewed changes

trl/trainer/grpo_config.py Outdated Show resolved Hide resolved

qgallouedec reviewed Apr 29, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

edbeeching added 3 commits April 30, 2025 11:40

refactor to generation_batch_size

141efb4

precommit

90e36e1

fix docstring

fba8511

qgallouedec mentioned this pull request May 1, 2025

Suggestions for Decouple gradient accumulation from the number of minibatches generated #3396

Merged

qgallouedec and others added 3 commits May 2, 2025 08:57

Suggestions for Decouple gradient accumulation from the number of min…

fefe183

…ibatches generated (#3396)

rename steps per generation

75fc4af

Merge branch 'main' into grpo-decouple-grad-acc

55a8944

qgallouedec approved these changes May 2, 2025

View reviewed changes

qgallouedec reviewed May 2, 2025

View reviewed changes

trl/trainer/grpo_config.py Outdated Show resolved Hide resolved

shirinyamani approved these changes May 5, 2025

View reviewed changes

qgallouedec changed the title ~~[GRPO] Decouple gradient accumulation from the number of minibatches generated~~ 💔 [GRPO] Decouple gradient accumulation from the number of minibatches generated May 6, 2025

Update trl/trainer/grpo_config.py

53982e4

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

add steps per generation option

7b6ea1f

edbeeching merged commit cf5183d into main May 6, 2025
10 checks passed

edbeeching deleted the grpo-decouple-grad-acc branch May 6, 2025 07:59

hjh0119 mentioned this pull request May 22, 2025

[grpo] generation batch size & mini-batch update modelscope/ms-swift#4322

Merged

4 tasks

LeonEricsson mentioned this pull request Jul 24, 2025

What is the point of steps_per_gen in GRPO Trainer #3662

Open

💔 [GRPO] Decouple gradient accumulation from the number of minibatches generated #3388

💔 [GRPO] Decouple gradient accumulation from the number of minibatches generated #3388

Uh oh!

Conversation

edbeeching commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Apr 29, 2025

Uh oh!

lewtun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lewtun commented Apr 29, 2025

Uh oh!

Uh oh!

Uh oh!

qgallouedec Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

lewtun Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

lewtun Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

edbeeching Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

lewtun Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lewtun commented May 5, 2025

Uh oh!

edbeeching commented May 6, 2025

Uh oh!

Uh oh!

Uh oh!

edbeeching commented Apr 29, 2025 •

edited

Loading