Skip to content

Implement Two-Sided Clipping for GRPO Trainer #3435

@ucalyptus

Description

@ucalyptus

Feature request

This feature proposal is to implement a two-sided clipping mechanism for the GRPO (Group Relative Policy Optimization) trainer. This modification addresses a potential stability issue in the standard GRPO formulation.

The proposed objective function is:

Image

This introduces a new hyperparameter, delta (δ), to GRPOConfig. This parameter caps the probability ratio for negative advantages.
The implementation involves:

  1. Adding delta to trl/trainer/grpo_config.py.
  2. Modifying _compute_loss in trl/trainer/grpo_trainer.py to use this new clipping logic.
  3. Adding a corresponding unit test in tests/test_grpo_trainer.py.

Motivation

The standard GRPO formulation can encounter stability issues, particularly when negative advantages (Â_t < 0) coincide with very large probability ratios (π_θ / π_θ_old). In such cases, the original clipping mechanism (which only applies when the ratio is too small for negative advantages) can lead to extremely large policy updates, potentially destabilizing the training process.

The proposed two-sided clipping mechanism aims to mitigate this by introducing an upper bound (delta) on the probability ratio when advantages are negative. This allows for significant updates but prevents the extreme changes that could harm training stability and robustness. The recommendation is to set delta > 1 + epsilon to ensure this balance.

Your contribution

Will open a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions