-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Feature request
This feature proposal is to implement a two-sided clipping mechanism for the GRPO (Group Relative Policy Optimization) trainer. This modification addresses a potential stability issue in the standard GRPO formulation.
The proposed objective function is:
This introduces a new hyperparameter, delta
(δ), to GRPOConfig
. This parameter caps the probability ratio for negative advantages.
The implementation involves:
- Adding
delta
totrl/trainer/grpo_config.py
. - Modifying
_compute_loss
intrl/trainer/grpo_trainer.py
to use this new clipping logic. - Adding a corresponding unit test in
tests/test_grpo_trainer.py
.
Motivation
The standard GRPO formulation can encounter stability issues, particularly when negative advantages (Â_t < 0) coincide with very large probability ratios (π_θ / π_θ_old). In such cases, the original clipping mechanism (which only applies when the ratio is too small for negative advantages) can lead to extremely large policy updates, potentially destabilizing the training process.
The proposed two-sided clipping mechanism aims to mitigate this by introducing an upper bound (delta
) on the probability ratio when advantages are negative. This allows for significant updates but prevents the extreme changes that could harm training stability and robustness. The recommendation is to set delta > 1 + epsilon
to ensure this balance.
Your contribution
Will open a PR