Add Entropy Control to GRPOTrainer #3628

1485840691 · 2025-06-22T08:39:24Z

What does this PR do?

Fixes # (3320)
#3320

The initial step is to support static entropy control
Next step is to support adaptive entropy control

Before submitting

[ N] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ Y] Did you read the contributor guideline,
Pull Request section?
[ Y] Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
[Y ] Did you make sure to update the documentation with your changes?
[ N] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

LeonEricsson · 2025-06-22T09:11:15Z

Note that there is a parallel PR (#3563) working on entropy based filtering, we're going to need to sync these

trl/trainer/grpo_config.py

trl/trainer/grpo_trainer.py

Tgt ent control

1485840691 · 2025-07-29T05:10:19Z

While reviewing the updated entropy controller I noted the following issues, which I should of realized sooner, apologies for that.

Hidden mutable state
The class keeps an internal entropy coefficient that mutates every call. Because that state lives outside the trainer/optimizer stack, it’s easy to miss in tests, logs, or checkpoints and makes debugging non-deterministic behaviour harder.

Distributed training
Right now every rank updates the coefficient from its local entropy, so the values drift apart. That means different GPUs are optimising slightly different objectives. The paper intends for a single global coefficient.

I suggest moving ownership of the entropy coefficient parameter to GRPOTrainer, making the entropy controller a pure strategy object which holds logic to step the entropy coefficient (rename __call__ to step()), change name to EntropyScheduler, and use global entropy to step the entropy coefficient. Also broadcast coefficient to all ranks. Something like this (reduce() and broadcast() are placeholders)
entropy_loss = agg_loss(...)

world_entropy = reduce(entropy_loss.detach(), reduction="mean")

if self.accelerator.is_main_process:
    self.entropy_coef = self.entropy_scheduler.step(
        self.entropy_coef, world_entropy
    )                   

broadcast(self.ent_coef, src=0)

loss = loss - self.ent_coef * entropy_loss

Yes, I also think it might be better to use a global scheduler to schedule the update of entropy coef based on global entropy loss gathered from all ranks. I took a look at the original code of skywork and think that it might be using a per rank scheduler to control entropy coef. If you have time, could you please help confirm it?

The entropy loss apply entropy coef: https://github.com/SkyworkAI/Skywork-OR1/blob/64e96afa213ae89d0ad21932106d3b8aafe9ace2/verl/workers/actor/dp_actor.py#L234

The entropy controller defined inside trainer

https://github.com/SkyworkAI/Skywork-OR1/blob/64e96afa213ae89d0ad21932106d3b8aafe9ace2/verl/trainer/ppo/ray_trainer.py#L391

https://github.com/SkyworkAI/Skywork-OR1/blob/64e96afa213ae89d0ad21932106d3b8aafe9ace2/verl/trainer/ppo/ray_trainer.py#L1097C25-L1098C1

LeonEricsson · 2025-07-29T08:35:43Z

@qgallouedec would appreciate your thoughts on dealing with the stateful entropy coefficient. To recap, Adaptive Entropy Control maintains the entropy coefficient $\alpha_k$ as an adaptive (or running) coefficient, which is incrementally updated on each optimizer step based on the batches entropy. Is something like this sufficient for maintaining a global entropy coefficient?

entropy = agg_loss(...)

world_entropy = all_reduce(entropy.detach(), reduction="mean")

self.entropy_coef = self.entropy_scheduler.step(
        self.entropy_coef, world_entropy
)                   

loss = loss - self.entropy_coef * entropy_loss

1485840691 · 2025-07-31T11:49:06Z

@qgallouedec would appreciate your thoughts on dealing with the stateful entropy coefficient. To recap, Adaptive Entropy Control maintains the entropy coefficient α k as an adaptive (or running) coefficient, which is incrementally updated on each optimizer step based on the batches entropy. Is something like this sufficient for maintaining a global entropy coefficient?
entropy = agg_loss(...)

world_entropy = reduce(entropy.detach(), reduction="mean")

if self.accelerator.is_main_process:
    self.entropy_coef = self.entropy_scheduler.step(
        self.entropy_coef, world_entropy
    )                   

broadcast(self.entropy_coef)

loss = loss - self.entropy_coef * entropy_loss

8c08682

I have submitted an initial update

My comment:

Could we only use a global entropy loss gathered from all rank?
In each rank, the process retains its own state of entropy coef and update based on global entropy loss
Any potential risk of applying this simpler update?

@LeonEricsson what do you think of my suggested approach of only using global entropy loss but keep per rank entropy scheduler? The entropy coef computation could be scattered in each rank but not blocked by main process. Besides, given each rank computes entropy coef based on global entropy loss, the entropy coef in each rank might be updated almost in the same path

tests/test_grpo_trainer.py

trl/trainer/grpo_trainer.py

…the current step

HuggingFaceDocBuilderDev · 2025-08-05T08:38:02Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-08-17T06:43:57Z

trl/trainer/grpo_trainer.py

@@ -185,6 +185,48 @@ def __len__(self) -> int:
        return (self.num_samples // self.batch_size) * self.batch_size * self.mini_repeat_count * self.repeat_count


+class EntropyController:


I don't think this controller is needed. Maybe it's simpler with only attributes in GRPO?

trl/trainer/grpo_trainer.py

1485840691 added 5 commits June 21, 2025 08:25

add entropy loss

41f67a2

add entropy loss to metrics

619b598

Merge branch 'main' of https://github.com/1485840691/trl

bfad506

re-format

07a59c9

use F

c4b2eee

1485840691 marked this pull request as draft June 22, 2025 08:39

1485840691 mentioned this pull request Jun 23, 2025

Add entropy based filtering inside the GRPOTrainer. #3563

Merged

5 tasks

Output alignment

93e049d

1485840691 closed this Jun 26, 2025

1485840691 force-pushed the main branch from 93e049d to 7e8ef86 Compare June 26, 2025 01:45

1485840691 added 2 commits June 27, 2025 03:52

merge commits

af13140

ent coef not equal 0

42d030a

1485840691 reopened this Jun 27, 2025

1485840691 and others added 5 commits June 27, 2025 04:18

fix format

d45f0fa

Merge branch 'main' into main

2302f48

fix ent loss log

c729673

fix mode

03ad8da

Merge branch 'main' of https://github.com/1485840691/trl

7b3c95c

1485840691 marked this pull request as ready for review June 29, 2025 09:22

Merge branch 'main' into main

91da19e

LeonEricsson reviewed Jul 1, 2025

View reviewed changes

trl/trainer/grpo_config.py Outdated Show resolved Hide resolved

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

1485840691 and others added 8 commits July 3, 2025 08:40

update based on review

1b30552

Merge branch 'main' of https://github.com/1485840691/trl

f7b4f3c

Merge branch 'main' into main

e73f820

adaptive entropy control

b022c79

adaptive entropy control update

97de806

adaptive entropy control update fmt

a99244c

Merge pull request #1 from 1485840691/tgt_ent

7641827

Tgt ent control

refactor loss in grpo

32d5c7c

Merge branch 'main' into main

e4ace46

1485840691 and others added 4 commits July 29, 2025 12:08

update based on feedback

79c4ee1

Merge branch 'main' into main

d77795f

sync update entropy coef

8c08682

Merge branch 'main' of https://github.com/1485840691/trl

ab6ede4

1485840691 and others added 3 commits August 1, 2025 23:45

Merge branch 'main' into main

dde326f

change coef collective

ce8bf67

nits

22f2fa9

LeonEricsson reviewed Aug 2, 2025

View reviewed changes

tests/test_grpo_trainer.py Outdated Show resolved Hide resolved

tests/test_grpo_trainer.py Outdated Show resolved Hide resolved

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

LeonEricsson and others added 3 commits August 5, 2025 10:11

separete entropy coefficient from the coefficient that is applied in …

df32688

…the current step

update tests

4d28df5

Merge branch 'main' into main

8fe1d94

1485840691 and others added 5 commits August 5, 2025 19:04

Merge branch 'main' into main

adc5bca

Merge branch 'main' into main

55b2e83

Merge branch 'main' into main

3d94cd7

Merge branch 'main' into main

736bb60

Merge branch 'main' into main

0278e69

LeonEricsson requested review from lewtun, qgallouedec, kashif and sergiopaniego August 12, 2025 19:32

1485840691 added 2 commits August 13, 2025 09:46

Merge branch 'main' into main

94afea9

Merge branch 'huggingface:main' into main

cb1cb85

qgallouedec reviewed Aug 17, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Show resolved Hide resolved

Merge branch 'main' into main

d04f542

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Entropy Control to GRPOTrainer #3628

Add Entropy Control to GRPOTrainer #3628

Uh oh!

1485840691 commented Jun 22, 2025

Uh oh!

LeonEricsson commented Jun 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

1485840691 commented Jul 29, 2025

Uh oh!

LeonEricsson commented Jul 29, 2025 •

edited

Loading

Uh oh!

1485840691 commented Jul 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Aug 5, 2025

Uh oh!

qgallouedec Aug 17, 2025

Uh oh!

Uh oh!

Uh oh!

		@@ -185,6 +185,48 @@ def __len__(self) -> int:
		return (self.num_samples // self.batch_size) * self.batch_size * self.mini_repeat_count * self.repeat_count


		class EntropyController:

Add Entropy Control to GRPOTrainer #3628

Are you sure you want to change the base?

Add Entropy Control to GRPOTrainer #3628

Uh oh!

Conversation

1485840691 commented Jun 22, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

LeonEricsson commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

1485840691 commented Jul 29, 2025

Uh oh!

LeonEricsson commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

1485840691 commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Aug 5, 2025

Uh oh!

qgallouedec Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

LeonEricsson commented Jun 22, 2025 •

edited

Loading

LeonEricsson commented Jul 29, 2025 •

edited

Loading

1485840691 commented Jul 31, 2025 •

edited

Loading