Implement Two-Sided Clipping for GRPO Trainer

### Feature request

This feature proposal is to implement a two-sided clipping mechanism for the GRPO (Group Relative Policy Optimization) trainer. This modification addresses a potential stability issue in the standard GRPO formulation.

The proposed objective function is:

<img width="1101" alt="Image" src="https://github.com/user-attachments/assets/8292b83b-fb83-4a43-8f8f-e3533c272d08" />

This introduces a new hyperparameter, `delta` (δ), to `GRPOConfig`. This parameter caps the probability ratio for negative advantages.
The implementation involves:
1. Adding `delta` to `trl/trainer/grpo_config.py`.
2. Modifying `_compute_loss` in `trl/trainer/grpo_trainer.py` to use this new clipping logic.
3. Adding a corresponding unit test in `tests/test_grpo_trainer.py`.

### Motivation

The standard GRPO formulation can encounter stability issues, particularly when negative advantages (Â_t < 0) coincide with very large probability ratios (π_θ / π_θ_old). In such cases, the original clipping mechanism (which only applies when the ratio is too small for negative advantages) can lead to extremely large policy updates, potentially destabilizing the training process.

The proposed two-sided clipping mechanism aims to mitigate this by introducing an upper bound (`delta`) on the probability ratio when advantages are negative. This allows for significant updates but prevents the extreme changes that could harm training stability and robustness. The recommendation is to set `delta > 1 + epsilon` to ensure this balance.

### Your contribution

Will open a PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Two-Sided Clipping for GRPO Trainer #3435

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement Two-Sided Clipping for GRPO Trainer #3435

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions