Skip to content

Add Entropy Control to GRPOTrainer #3628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 61 commits into
base: main
Choose a base branch
from

Conversation

1485840691
Copy link
Contributor

What does this PR do?

Fixes # (3320)
#3320

The initial step is to support static entropy control
Next step is to support adaptive entropy control

Before submitting

  • [ N] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [ Y] Did you read the contributor guideline,
    Pull Request section?
  • [ Y] Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • [Y ] Did you make sure to update the documentation with your changes?
  • [ N] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@1485840691 1485840691 marked this pull request as draft June 22, 2025 08:39
@LeonEricsson
Copy link
Collaborator

LeonEricsson commented Jun 22, 2025

Note that there is a parallel PR (#3563) working on entropy based filtering, we're going to need to sync these

@1485840691 1485840691 reopened this Jun 27, 2025
@1485840691 1485840691 marked this pull request as ready for review June 29, 2025 09:22
@1485840691
Copy link
Contributor Author

While reviewing the updated entropy controller I noted the following issues, which I should of realized sooner, apologies for that.

  1. Hidden mutable state
    The class keeps an internal entropy coefficient that mutates every call. Because that state lives outside the trainer/optimizer stack, it’s easy to miss in tests, logs, or checkpoints and makes debugging non-deterministic behaviour harder.
  2. Distributed training
    Right now every rank updates the coefficient from its local entropy, so the values drift apart. That means different GPUs are optimising slightly different objectives. The paper intends for a single global coefficient.

I suggest moving ownership of the entropy coefficient parameter to GRPOTrainer, making the entropy controller a pure strategy object which holds logic to step the entropy coefficient (rename __call__ to step()), change name to EntropyScheduler, and use global entropy to step the entropy coefficient. Also broadcast coefficient to all ranks. Something like this (reduce() and broadcast() are placeholders)

entropy_loss = agg_loss(...)

world_entropy = reduce(entropy_loss.detach(), reduction="mean")

if self.accelerator.is_main_process:
    self.entropy_coef = self.entropy_scheduler.step(
        self.entropy_coef, world_entropy
    )                   

broadcast(self.ent_coef, src=0)

loss = loss - self.ent_coef * entropy_loss

Yes, I also think it might be better to use a global scheduler to schedule the update of entropy coef based on global entropy loss gathered from all ranks. I took a look at the original code of skywork and think that it might be using a per rank scheduler to control entropy coef. If you have time, could you please help confirm it?

The entropy loss apply entropy coef: https://github.com/SkyworkAI/Skywork-OR1/blob/64e96afa213ae89d0ad21932106d3b8aafe9ace2/verl/workers/actor/dp_actor.py#L234

The entropy controller defined inside trainer

https://github.com/SkyworkAI/Skywork-OR1/blob/64e96afa213ae89d0ad21932106d3b8aafe9ace2/verl/trainer/ppo/ray_trainer.py#L391

https://github.com/SkyworkAI/Skywork-OR1/blob/64e96afa213ae89d0ad21932106d3b8aafe9ace2/verl/trainer/ppo/ray_trainer.py#L1097C25-L1098C1

@LeonEricsson
Copy link
Collaborator

LeonEricsson commented Jul 29, 2025

@qgallouedec would appreciate your thoughts on dealing with the stateful entropy coefficient. To recap, Adaptive Entropy Control maintains the entropy coefficient $\alpha_k$ as an adaptive (or running) coefficient, which is incrementally updated on each optimizer step based on the batches entropy. Is something like this sufficient for maintaining a global entropy coefficient?

entropy = agg_loss(...)

world_entropy = all_reduce(entropy.detach(), reduction="mean")

self.entropy_coef = self.entropy_scheduler.step(
        self.entropy_coef, world_entropy
)                   

loss = loss - self.entropy_coef * entropy_loss

@1485840691
Copy link
Contributor Author

1485840691 commented Jul 31, 2025

@qgallouedec would appreciate your thoughts on dealing with the stateful entropy coefficient. To recap, Adaptive Entropy Control maintains the entropy coefficient α k as an adaptive (or running) coefficient, which is incrementally updated on each optimizer step based on the batches entropy. Is something like this sufficient for maintaining a global entropy coefficient?

entropy = agg_loss(...)

world_entropy = reduce(entropy.detach(), reduction="mean")

if self.accelerator.is_main_process:
    self.entropy_coef = self.entropy_scheduler.step(
        self.entropy_coef, world_entropy
    )                   

broadcast(self.entropy_coef)

loss = loss - self.entropy_coef * entropy_loss

8c08682

I have submitted an initial update

My comment:

Could we only use a global entropy loss gathered from all rank?
In each rank, the process retains its own state of entropy coef and update based on global entropy loss
Any potential risk of applying this simpler update? 

@LeonEricsson what do you think of my suggested approach of only using global entropy loss but keep per rank entropy scheduler? The entropy coef computation could be scattered in each rank but not blocked by main process. Besides, given each rank computes entropy coef based on global entropy loss, the entropy coef in each rank might be updated almost in the same path

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@@ -185,6 +185,48 @@ def __len__(self) -> int:
return (self.num_samples // self.batch_size) * self.batch_size * self.mini_repeat_count * self.repeat_count


class EntropyController:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this controller is needed. Maybe it's simpler with only attributes in GRPO?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants