🔭 [GRPO] Log advantages and fraction of samples with an std of zero #3502

edbeeching · 2025-05-27T11:48:35Z

What does this PR do?

This PR implements:

Logging of advantages in the wandb table
Logging the fraction of samples which have a reward std of zero.

lewtun

LGTM with a nit and once the docs are updated too: https://github.com/huggingface/trl/blob/main/docs/source/grpo_trainer.md#logged-metrics

trl/trainer/grpo_trainer.py

…og-adv

HuggingFaceDocBuilderDev · 2025-05-27T12:54:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

trl/trainer/grpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

qgallouedec · 2025-05-27T19:47:07Z

Now it looks good!

qgallouedec · 2025-05-27T19:56:55Z

And the new metric:

import tempfile
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
import random

config_name = "standard_prompt_only"
dataset = load_dataset("trl-internal-testing/zen", config_name, split="train")

def random_reward(completions, **kwargs):
    return [random.choice([0.0, 1.0]) for _ in completions]

with tempfile.TemporaryDirectory() as tmp_dir:
    training_args = GRPOConfig(
        output_dir=tmp_dir,
        learning_rate=0.1,  # increase the learning rate to speed up the test
        per_device_train_batch_size=6,  # reduce the batch size to reduce memory usage
        num_generations=3,  # reduce the number of generations to reduce memory usage
        max_completion_length=8,  # reduce the completion length to reduce memory usage
        log_completions=True,
        logging_steps=3,
    )
    trainer = GRPOTrainer(
        model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
        reward_funcs=random_reward,
        args=training_args,
        train_dataset=dataset,
    )

    trainer.train()

edbeeching and others added 4 commits May 27, 2025 11:03

add advantage logging

914dfef

add frac zero std reporting

bde1070

precommit

ab8b8e1

Merge branch 'main' into grpo-log-adv

96b9f8b

edbeeching marked this pull request as ready for review May 27, 2025 11:48

edbeeching requested review from lewtun, qgallouedec and shirinyamani May 27, 2025 11:49

lewtun approved these changes May 27, 2025

View reviewed changes