Skip to content

PRIME-RL/Entropy-Mechanism-of-RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Entropy Mechanism of Reinforcement Learning for Large Language Model Reasoning

Paper Github alphaXiv Twitter Twitter Twitter-ak

🎉News

  • [2025/06/20] 🎉 Our KL_Cov and Clip_Cov are merged into verl! One can use our method easily in verl main with setting loss_mode as "clip_cov" or "kl_cov", an example script in verl can be found here.
  • [2025/06/04] Sent a PR to verl, one can use our approaches in verl with PR #1830.
  • [2025/06/03] 🎉 Ranked #3 of the week on Huggingface Weekly Papers.
  • [2025/05/29] 🎉 Ranked #1 of the day on Huggingface Daily Papers.
  • [2025/05/29] Released our Paper on arXiv. See here. We provide insights into the entropy mechanism of RL for LLMs and propose two simple yet effective strategies to alleviate the entropy collapse.

✨Getting started

This repo is forked from verl. We build our code on the dapo recipe.

Note: For any training and test set, please modify the system prompt and specify the answer in the format of \boxed{}, as we need to extract the answer based on \boxed{} for verification.

Our training and evaluation data can be found here.

Installation

You can install dependencies by running the following commands:

conda env create -n entropy -f environment.yaml

Training

Before training, you need to ensure that the AIME, AIME25 and AMC datasets are with "data_source" of "aime", "aime25" and "amc" respectively. As we hardcode it to make sure they are rollouted with temperature of 0.6.

For training Qwen2.5-7B on a single node, taking the KL-Cov approach as an example, you can simply run:

cd Entropy-Mechanism-of-RL
conda activate entropy
bash recipe/dapo/7b_kl_cov.sh

While for training Qwen2.5-32B on multi nodes, you can try to run:

cd Entropy-Mechanism-of-RL
conda activate entropy
bash recipe/dapo/32b_kl_cov.sh

If you encounter some issues of starting Ray on the multi nodes, you can try the alternative way:

export WANDB_API_KEY=YOUR_WANDB_KEY
source /your/path/to/miniconda3/etc/profile.d/conda.sh
conda activate entropy
cd Entropy-Mechanism-of-RL
python recipe/dapo/example_run_on_nodes.py

📖Introduction

issue

This paper addresses the entropy collapse issue in scaling reinforcement learning (RL) for large language models (LLMs), where policy entropy drops sharply during training, leading to overconfidence and performance saturation. We empirically establish a relationship between entropy ($H$) and performance ($R$): $R=−aexp(H)+b$, showing performance is bottlenecked by entropy exhaustion.

issue

Theoretically, we find entropy changes are driven by the covariance between action probability and logit updates, which correlates with advantage in Policy Gradient methods. High-probability, high-advantage actions reduce entropy, while rare, high-advantage actions increase it. Empirically, the covariance term remains positive, explaining entropy’s monotonic decline. To mitigate this, we propose ​​Clip-Cov​​ and ​​KL-Cov​​, which restrict updates for high-covariance tokens. These methods effectively prevent entropy collapse, and improve performance.

📃Evaluation

issue

Our method is able to maintain a considerably higher level of entropy throughout training. For example, when the baseline's entropy reaches a plateau and can no longer be consumed, the KL-Cov method still sustains an entropy level over 10 times higher. Meanwhile, the response length of the policy model steadily increases, and its performance on the test set consistently surpasses that of the baseline. This indicates that our model is able to explore more freely during training, learning better policy through RL.

issue

Our two approaches both achieve non-trivial improvements across all benchmarks. Compared to GRPO, our method outperforms it by 2.0% on average for the 7B model and by 6.4% for the 32B model. Moreover, we observe that our method yields more substantial gains on the larger Qwen2.5-32B. Specifically, our method achieves improvements of 15.0% and 14.6% compared to GRPO on the most challenging benchmarks, AIME24 and AIME25, respectively.

🎈Citation

If you find this paper or repo helpful, please cite us.

@article{cui2025entropy,
  title={The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models},
  author={Cui, Ganqu and Zhang, Yuchen and Chen, Jiacheng and Yuan, Lifan and Wang, Zhi and Zuo, Yuxin and Li, Haozhan and Fan, Yuchen and Chen, Huayu and Chen, Weize and others},
  journal={arXiv preprint arXiv:2505.22617},
  year={2025}
}

🌻Acknowledgement

We implement our reinforcement learning algorithm extending from veRL. We utilize vLLM for inference. Our models are trained primarily on Qwen2.5 family. Our training data is built from DAPO-MATH. Thanks for their great contributions!

📬 Contact

For questions, discussion, or collaboration opportunities, feel free to contact:

📈Star History

Star History Chart

About

The Entropy Mechanism of Reinforcement Learning for Large Language Model Reasoning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published