Skip to content

🔭 [GRPO] Log advantages and fraction of samples with an std of zero #3502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 27, 2025

Conversation

edbeeching
Copy link
Collaborator

@edbeeching edbeeching commented May 27, 2025

What does this PR do?

This PR implements:

  • Logging of advantages in the wandb table
  • Logging the fraction of samples which have a reward std of zero.

@edbeeching edbeeching marked this pull request as ready for review May 27, 2025 11:48
Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec and others added 5 commits May 27, 2025 11:06
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
@qgallouedec
Copy link
Member

Now it looks good!

Screenshot 2025-05-27 at 12 46 37

@qgallouedec
Copy link
Member

And the new metric:

Screenshot 2025-05-27 at 12 55 39
import tempfile
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
import random

config_name = "standard_prompt_only"
dataset = load_dataset("trl-internal-testing/zen", config_name, split="train")

def random_reward(completions, **kwargs):
    return [random.choice([0.0, 1.0]) for _ in completions]

with tempfile.TemporaryDirectory() as tmp_dir:
    training_args = GRPOConfig(
        output_dir=tmp_dir,
        learning_rate=0.1,  # increase the learning rate to speed up the test
        per_device_train_batch_size=6,  # reduce the batch size to reduce memory usage
        num_generations=3,  # reduce the number of generations to reduce memory usage
        max_completion_length=8,  # reduce the completion length to reduce memory usage
        log_completions=True,
        logging_steps=3,
    )
    trainer = GRPOTrainer(
        model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
        reward_funcs=random_reward,
        args=training_args,
        train_dataset=dataset,
    )

    trainer.train()

@qgallouedec qgallouedec changed the title [GRPO] Log advantages and fraction of samples with an std of zero 🔭 [GRPO] Log advantages and fraction of samples with an std of zero May 27, 2025
@qgallouedec qgallouedec merged commit 0b6a187 into main May 27, 2025
11 checks passed
@qgallouedec qgallouedec deleted the grpo-log-adv branch May 27, 2025 19:58
qgallouedec added a commit that referenced this pull request May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants