Add entropy based filtering inside the GRPOTrainer. #3563

pramodith · 2025-06-10T12:27:58Z

What does this PR do?

This PR is in relation to #3555 which proposes to mask out the policy loss coming from tokens in the completions corresponding to positions with an entropy scores below the bottom-k percentile.

This idea is proposed by the Qwen team in their accompanying paper Beyond the 80/20 Rule

Key Proposals of the paper that guided the implementation

$$\mathcal{J}^{\mathcal{B}}_{\text{HighEnt}}(\theta) = \mathbb{E}_{\mathcal{B} \sim \mathcal{D},\,(q,a) \sim \mathcal{B},\, \{o^i\}_{i=1}^{G} \sim \pi_{\text{old}}(\cdot|q)} \left[ \frac{1}{\sum_{i=1}^{G} |o^i|} \sum_{i=1}^{G} \sum_{t=1}^{|o^i|} \mathbb{I}\left[H^i_t \geq \tau^{\mathcal{B}}_{\rho} \right] \cdot \min\left(r^i_t(\theta) \hat{A}^i_t,\ \text{clip}(r^i_t(\theta), 1 - \epsilon_{\text{low}}, 1 + \epsilon_{\text{high}}) \hat{A}^i_t \right) \right], \quad \text{s.t. } 0 < \left|\left\{ o^i \mid \text{is\_equivalent}(a, o^i) \right\}\right| < G$$

The key difference is the term

$$\mathbb{I}\left[H^i_t \geq \tau^{\mathcal{B}}_{\rho} \right]$$

From the paper

Here, $H^i_t$ denotes the entropy of token t in response i, $\mathbb{I}[·]$ is the indicator function that evaluates to 1 if the
condition inside holds and 0 otherwise, ρ ∈ (0, 1] is a predefined ratio specifying the top proportion of high-entropy tokens to be selected within a batch, and $\tau^{\mathcal{B}}$ is the corresponding entropy threshold within
the batch B such that only tokens with $H^i_t ≥ \tau^{\mathcal{B}}_{\rho}$ , comprising the top-ρ fraction of all tokens in the batch, are used to compute the gradient.

Entropy is calculated as normal via the formula

$$H_t := - \sum_{j=1}^{V} p_{t,j} \log p_{t,j}$$

The paper applies the entropy mask to the DAPO loss function in their experiments, but I think we can leave it to the user to decide which loss i.e. GRPO, Dr. GRPO or DAPO to apply it to.

The paper finds that the best threshold is to keep the top-20% of tokens based on their entropy.

We ultimately improve RLVR by restricting policy gradient updates to
forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of
the tokens while maintaining performance comparable to full-gradient updates on the
Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-
32B (+11.04 on AIME’25 and +7.71 on AIME’24) and Qwen3-14B (+4.79 on AIME’25 and
+5.21 on AIME’24)

I didn't run the vllm tests inside test_grpo_trainer.py since my machine/vm didn't have access to a gpu.

Fixes #3555

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

LeonEricsson · 2025-06-12T11:49:11Z

You should be able to calculate the entropy directly from log probs as

entropy = -(per_token_logps.exp() * per_token_logps).sum(-1)

which means we don't have to modify _get_per_token_logps. I noted however that you didn't apply temperature scaling when calculating your entropy, which my proposed solution does. As I see it temperature scaling remains an unresolved variable: the paper evaluates entropy masking only at T = 1, so the optimal percentile thresholds for other temperatures are unknown. The same caveat applies to model choice—results are reported solely for the Qwen-3 series, and the threshold may shift substantially for other architectures. In practice, the 20 % rule should be treated as a starting heuristic; users will need to experiment with the temperature-model-percentile combination that best fits their own setup.

pramodith · 2025-06-12T11:56:56Z

~~Ah, yeah that's a good point let me update the entropy calculation to use the logprobs directly. Yep the temperature and 20% threshold should be hyper-parameters that are tuned by the user.~~

pramodith · 2025-06-12T12:10:02Z

Also just realized that the temperature parameter won't affect the ranking order of entropy since all positions are affected by the same temperature, so the temp param shouldn't matter here.

pramodith · 2025-06-12T12:26:27Z

You should be able to calculate the entropy directly from log probs as
entropy = -(per_token_logps.exp() * per_token_logps).sum(-1)

Actually this isn't true the per_token_logps only contain the log probs of the token in the sampled response. Entropy needs to be calculated over the probability distribution over the entire vocabulary per seq position in the response, so we do need to update the _get_per_token_logps since it uses the selective_log_softmax function to gather the probs of just the tokens in sampled response.

LeonEricsson · 2025-06-12T12:58:05Z

You should be able to calculate the entropy directly from log probs as
entropy = -(per_token_logps.exp() * per_token_logps).sum(-1)
Actually this isn't true the per_token_logps only contain the log probs of the token in the sampled response. Entropy needs to be calculated over the probability distribution over the entire vocabulary per seq position in the response, so we do need to update the _get_per_token_logps since it uses the selective_log_softmax function to gather the probs of just the tokens in sampled response.

yes of course, my bad. not really a fan of the proposed refactor of _get_per_token_logps so trying to think of an alternative solution. Things of concern: 1) logits is a big tensor, clogs VRAM. 2) performing torch.softmax(logits, dim=-1) will spike memory usage. See #3390 for discussion. But maybe we just have to bite the bullet here, and in this case we should definitely only perform the entropy calculation when necessary

qgallouedec · 2025-06-13T14:40:52Z

Nice! Thanks!
Yes, as @LeonEricsson says, the tricky part is the memory peak of the softmax. There as to be a way to compute the entropy efficiently, similarly to what we do with selective_log_softmax.

Another recommendation: 1 argument is probably enough.

- if self.filter_on_entropy:
+ if self.token_entropy_percentile_threshold < 1.0:

and add that the recommended value is 0.2 in the documentation.

pramodith · 2025-06-13T14:45:47Z

Cool, let me look into making the entropy calculation less memory intensive.

pramodith · 2025-06-15T15:08:29Z

Updated the code to make sure that only a mini-batch of logits are materialized at any given point of time and entropies for those mini-batches of logits are optionally calculated. The rowise_entropy function is inspired by the selective_log_softmax function to calculate entropy per row of the mini-batch. Introduces a bit of redundant compute because we compute the log_softmax twice in entropy calc and selective_log_softmax, but I guess its worth the tradeoff with the gains in clean code and memory.

trl/trainer/grpo_trainer.py

LeonEricsson · 2025-06-16T12:06:23Z

Updated the code to make sure that only a mini-batch of logits are materialized at any given point of time and entropies for those mini-batches of logits are optionally calculated. The rowise_entropy function is inspired by the selective_log_softmax function to calculate entropy per row of the mini-batch. Introduces a bit of redundant compute because we compute the log_softmax twice in entropy calc and selective_log_softmax, but I guess its worth the tradeoff with the gains in clean code and memory.

nice work. left a few minor comments

pramodith · 2025-06-16T12:53:53Z

Thanks for the review Leon! Made the suggested changes.

pramodith · 2025-06-18T14:52:27Z

@qgallouedec please take another look at the PR when you have the time.

LeonEricsson · 2025-06-22T17:50:37Z

I spent some time benchmarking different entropy calculations, scripts here. Long story short I recommend:

def compute_entropy(logits: torch.Tensor, chunk_size: int = 1) -> torch.Tensor:
    """
    Compute the Shannon entropy (nats) for each row of `logits`
    with a memory-friendly, `log_softmax → exp → reduce` sequence.

    Args:
        logits (`torch.Tensor`):
            Tensor of shape `(..., num_classes)` holding unnormalised
            log-probabilities.
        chunk_size (`int`, *optional*, defaults to `1`):
            Number of rows processed together. Keeping it
            at `1` minimises peak memory; larger values trade a small
            rise in memory for fewer launches.  Default is `1`.

    Returns:
        `torch.Tensor`:
            Entropy values with shape `logits.shape[:-1]`.
    """
    outs = []
    for chunk in logits.split(chunk_size, dim=0): # loop to reduce peak mem consumption
        logp = F.log_softmax(chunk, dim=-1)
        entropy = -(logp.exp() * logp).sum(-1)
        outs.append(entropy)
    return torch.cat(outs, dim=0)

this is basically what you already had, but importantly we sum inside the loop which avoids materializing the [S, V] tensor. I observed no latency improvements when increasing chunk_size on a 4090, because it's already memory-bound at 1

1485840691 · 2025-06-23T01:46:43Z

@qgallouedec @LeonEricsson this PR has a bit of duplicate with my implementation of entropy regularization loss in #3628. We need to sync on both given the 2 address entropy in two different directions. The entropy regularization loss is proposed in issue and also officially implemented in verl.

pramodith · 2025-06-23T10:13:43Z

@1485840691 I think that despite the overlap both of the purposes are complementary and once this PR is pushed including the entropy loss in the final loss should be fairly simple.

pramodith · 2025-06-23T10:19:33Z

@LeonEricsson thanks for those cool benchmarks! I've updated the code to compute the sum inside the for loop as you've suggested.

1485840691 · 2025-06-23T11:25:46Z

I have a comment on entropy from logits. Given verl has already provided an implementation https://github.com/volcengine/verl/blob/9b7bb69ea3165b691cc908d7f3f2f14c4a65a59e/verl/utils/torch_functional.py#L150, why do we not re-use that?

Sorry, I tried benchmark it. Indeed verl's implementation does not have better running performance
"
Duration of rowise_entropy: 0.7136 seconds
Duration of entropy_from_logits: 1.4799 seconds
"

But given the entropy from logits function is embed inside the chunk loop of get_per_token_logps, why do we need to support chunking here?

1485840691 · 2025-06-23T11:29:05Z

And there is another question regarding the interaction between entropy loss and entropy mask: Do we need to consider the entropy mask in computing entropy loss? Now entropy loss is computed using completion mask @qgallouedec @LeonEricsson

pramodith · 2025-06-23T11:54:23Z

I have a comment on entropy from logits. Given verl has already provided an implementation https://github.com/volcengine/verl/blob/9b7bb69ea3165b691cc908d7f3f2f14c4a65a59e/verl/utils/torch_functional.py#L150, why do we not re-use that? Given the entropy from logits function is embed inside the chunk loop of get_per_token_logps, I think we might not need to support chunking here. Even if we do need chunking, verl has in-place implementation: https://github.com/volcengine/verl/blob/9b7bb69ea3165b691cc908d7f3f2f14c4a65a59e/verl/utils/torch_functional.py#L150

I'm assuming you mean why we don't simply import the function from veRL. I think that we'd like to avoid introducing new dependencies for fairly simple utility functions. Good point on supporting chunking inside the entropy function, while you correctly point out that we don't need to support chunking for the current implementation of GRPO since the get_per_token_logps does it, I think it makes sense to support chunking inside the rowise_entropy function to make it more re-usable for other usecases.

pramodith · 2025-06-23T11:57:36Z

And there is another question regarding the interaction between entropy loss and entropy mask: Do we need to consider the entropy mask in computing entropy loss? Now entropy loss is computed using completion mask @qgallouedec @LeonEricsson

I don't think the entropy loss should take the entropy mask into consideration. In the entropy masking paper linked above, their intention was to just affect the policy loss. So similar to how KL-div doesn't consider the entropy mask, I believe that the entropy loss shouldn't either, especially if we want to reproduce the behavior of veRL.

1485840691 · 2025-06-23T11:59:38Z

I have a comment on entropy from logits. Given verl has already provided an implementation https://github.com/volcengine/verl/blob/9b7bb69ea3165b691cc908d7f3f2f14c4a65a59e/verl/utils/torch_functional.py#L150, why do we not re-use that? Given the entropy from logits function is embed inside the chunk loop of get_per_token_logps, I think we might not need to support chunking here. Even if we do need chunking, verl has in-place implementation: https://github.com/volcengine/verl/blob/9b7bb69ea3165b691cc908d7f3f2f14c4a65a59e/verl/utils/torch_functional.py#L150

I'm assuming you mean why we don't simply import the function from veRL. I think that we'd like to avoid introducing new dependencies for fairly simple utility functions. Good point on supporting chunking inside the entropy function, while you correctly point out that we don't need to support chunking for the current implementation of GRPO since the get_per_token_logps does it, I think it makes sense to support chunking inside the rowise_entropy function to make it more re-usable for other usecases.

I think we might follow verl to support entropy_from_logits(https://github.com/volcengine/verl/blob/9b7bb69ea3165b691cc908d7f3f2f14c4a65a59e/verl/utils/torch_functional.py#L143) and entropy_from_logits_chunking(https://github.com/volcengine/verl/blob/9b7bb69ea3165b691cc908d7f3f2f14c4a65a59e/verl/utils/torch_functional.py#L150). And I do not mean importing verl lib to support such small util function ,just copy the code and add a comment. But given I have benchmarked it. verl's implementation does not have better performance.

pramodith · 2025-06-23T12:19:45Z

I have a comment on entropy from logits. Given verl has already provided an implementation https://github.com/volcengine/verl/blob/9b7bb69ea3165b691cc908d7f3f2f14c4a65a59e/verl/utils/torch_functional.py#L150, why do we not re-use that?

Sorry, I tried benchmark it. Indeed verl's implementation does not have better running performance " Duration of rowise_entropy: 0.7136 seconds Duration of entropy_from_logits: 1.4799 seconds "

But given the entropy from logits function is embed inside the chunk loop of get_per_token_logps, why do we need to support chunking here?

Sorry, I missed this comment. I think the objective here is to reduce peak memory usage rather than optimize on run-time, based on the comments from Leon and Quentin

Nice! Thanks!
Yes, as @LeonEricsson says, the tricky part is the memory peak of the softmax. There as to be a way to compute the entropy efficiently, similarly to what we do with selective_log_softmax.

LeonEricsson · 2025-06-23T17:09:19Z

But given the entropy from logits function is embed inside the chunk loop of get_per_token_logps, why do we need to support chunking here?

I agree that chunking isn't necessary here, but since it's defined as a general utility function and could be useful in future scenarios, its nice to have; provides a convenient way to trade memory for latency.

I think we might follow verl to support entropy_from_logits(https://github.com/volcengine/verl/blob/9b7bb69ea3165b691cc908d7f3f2f14c4a65a59e/verl/utils/torch_functional.py#L143) and entropy_from_logits_chunking(https://github.com/volcengine/verl/blob/9b7bb69ea3165b691cc908d7f3f2f14c4a65a59e/verl/utils/torch_functional.py#L150).

Yeah, entropy_from_logits_chunking looks very similar to what I benchmarked here. I’d be fine with it as well, but based on my isolated benchmarks, I found it has the same memory usage while being slower.

I don't think the entropy loss should take the entropy mask into consideration. In the entropy masking paper linked above, their intention was to just affect the policy loss. So similar to how KL-div doesn't consider the entropy mask, I believe that the entropy loss shouldn't either, especially if we want to reproduce the behavior of veRL.

~~looking into this~~
I agree with @pramodith, entropy filtering is applied to the policy loss, and the entropy loss should be entirely separate from the policy loss.

trl/trainer/utils.py

tests/test_utils.py

trl/trainer/grpo_trainer.py

LeonEricsson · 2025-06-24T16:16:15Z

some final comments then I'm happy to merge.

HuggingFaceDocBuilderDev · 2025-06-25T08:58:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-authored-by: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

pramodith and others added 4 commits June 10, 2025 13:02

Add entropy based filtering inside the GRPOTrainer.

4a39008

Merge branch 'main' into pramodith/grpo_entropy_filter

a2bdfa4

Make precommit.

acfc243

Merge branch 'main' into pramodith/grpo_entropy_filter

ba5ba76

LeonEricsson mentioned this pull request Jun 12, 2025

[GRPO] Entropy metric #3571

Closed

style

ca4781d

Add rowwise entropy to reduce memory consumed by softmax.

77bf7e7

remove filter_on_entropy.

5846a76

LeonEricsson reviewed Jun 16, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Show resolved Hide resolved

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

Raise exception when liger kernels and entropy both are enabled.

623979a

pramodith added 6 commits June 16, 2025 13:53

Merge branch 'main' into pramodith/grpo_entropy_filter

4759748

Merge branch 'main' into pramodith/grpo_entropy_filter

88d79f9

Fix merge.

4cca5c3

Fix erroneous function call.

23604e4

Ref tokens can now be found in inputs in the _compute_loss function.

dec546d

Merge branch 'main' into pramodith/grpo_entropy_filter

5d1d712

pramodith and others added 3 commits June 19, 2025 10:36

Merge branch 'main' into pramodith/grpo_entropy_filter

f1fb8b6

Merge branch 'main' into pramodith/grpo_entropy_filter

83720ae

Merge branch 'main' into pramodith/grpo_entropy_filter

b263405

LeonEricsson mentioned this pull request Jun 22, 2025

Add Entropy Control to GRPOTrainer #3628

Open

Sum up entropies inside for loop.

fba0351

Merge branch 'main' into pramodith/grpo_entropy_filter

65c7150

Add chunk_size support to rowise_entropy.

09ae8e9

Merge branch 'main' into pramodith/grpo_entropy_filter

a40308c

LeonEricsson reviewed Jun 24, 2025

View reviewed changes

trl/trainer/utils.py Outdated Show resolved Hide resolved

pramodith added 2 commits June 24, 2025 13:20

Iterate more efficiently.

2a28ebf

Merge branch 'main' into pramodith/grpo_entropy_filter

c87a7b9

LeonEricsson reviewed Jun 24, 2025

View reviewed changes

trl/trainer/utils.py Outdated Show resolved Hide resolved

trl/trainer/utils.py Outdated Show resolved Hide resolved

tests/test_utils.py Outdated Show resolved Hide resolved

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

pramodith and others added 2 commits June 24, 2025 16:31

Update entropy function's name and fix names based on review.

37f86ea

fix test

fe51bb6

LeonEricsson approved these changes Jun 25, 2025

View reviewed changes

LeonEricsson and others added 2 commits June 25, 2025 15:49

style

b166762

Merge branch 'main' into pramodith/grpo_entropy_filter

ff14719

LeonEricsson merged commit 7e8ef86 into huggingface:main Jun 25, 2025
9 of 10 checks passed

qgallouedec mentioned this pull request Jul 12, 2025

🏗️ Refactor top-entropy in GRPO #3727

Merged

Add entropy based filtering inside the GRPOTrainer. #3563

Add entropy based filtering inside the GRPOTrainer. #3563

Uh oh!

Conversation

pramodith commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Key Proposals of the paper that guided the implementation

Before submitting

Who can review?

Uh oh!

LeonEricsson commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pramodith commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pramodith commented Jun 12, 2025

Uh oh!

pramodith commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LeonEricsson commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pramodith commented Jun 13, 2025

Uh oh!

pramodith commented Jun 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LeonEricsson commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pramodith commented Jun 16, 2025

Uh oh!

pramodith commented Jun 18, 2025

Uh oh!

LeonEricsson commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

1485840691 commented Jun 23, 2025

Uh oh!

pramodith commented Jun 23, 2025

Uh oh!

pramodith commented Jun 23, 2025

Uh oh!

1485840691 commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

1485840691 commented Jun 23, 2025

Uh oh!

pramodith commented Jun 23, 2025

Uh oh!

pramodith commented Jun 23, 2025

Uh oh!

1485840691 commented Jun 23, 2025

Uh oh!

pramodith commented Jun 23, 2025

Uh oh!

LeonEricsson commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LeonEricsson commented Jun 24, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

pramodith commented Jun 10, 2025 •

edited

Loading

LeonEricsson commented Jun 12, 2025 •

edited

Loading

pramodith commented Jun 12, 2025 •

edited

Loading

pramodith commented Jun 12, 2025 •

edited

Loading

LeonEricsson commented Jun 12, 2025 •

edited

Loading

qgallouedec commented Jun 13, 2025 •

edited

Loading

LeonEricsson commented Jun 16, 2025 •

edited

Loading

LeonEricsson commented Jun 22, 2025 •

edited

Loading

1485840691 commented Jun 23, 2025 •

edited

Loading

LeonEricsson commented Jun 23, 2025 •

edited

Loading