⚖️ Add option not to scale rewards (Dr. GRPO) #3135

qgallouedec · 2025-03-22T19:11:38Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-03-22T19:15:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zaddy6 · 2025-03-22T19:51:45Z

Using scale_rewards=False, throws this error

[rank5]: Traceback (most recent call last):
[rank5]:   File "/workspace/never_peft.py", line 900, in <module>
[rank5]:     trainer.train()
[rank5]:   File "/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
[rank5]:     return inner_training_loop(
[rank5]:            ^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/transformers/trainer.py", line 2556, in _inner_training_loop
[rank5]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank5]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/transformers/trainer.py", line 3712, in training_step
[rank5]:     inputs = self._prepare_inputs(inputs)
[rank5]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/trl/extras/profiling.py", line 87, in wrapper
[rank5]:     return func(self, *args, **kwargs)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 647, in _prepare_inputs
[rank5]:     inputs = self._generate_and_score_completions(inputs)
[rank5]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 861, in _generate_and_score_completions
[rank5]:     self._metrics[mode]["reward_std"].append(std_grouped_rewards.mean().item())
[rank5]:                                              ^^^^^^^^^^^^^^^^^^^
[rank5]: UnboundLocalError: cannot access local variable 'std_grouped_rewards' where it is not associated with a value
  File "/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/trl/extras/profiling.py", line 87, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 647, in _prepare_inputs
    inputs = self._generate_and_score_completions(inputs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 861, in _generate_and_score_completions
    self._metrics[mode]["reward_std"].append(std_grouped_rewards.mean().item())
                                             ^^^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'std_grouped_rewards' where it is not associated with a value
[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/never_peft.py", line 900, in <module>
[rank0]:     trainer.train()
[rank0]:   File "/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/transformers/trainer.py", line 2556, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/transformers/trainer.py", line 3712, in training_step
[rank0]:     inputs = self._prepare_inputs(inputs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/trl/extras/profiling.py", line 87, in wrapper
[rank0]:     return func(self, *args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 647, in _prepare_inputs
[rank0]:     inputs = self._generate_and_score_completions(inputs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/unsloth_env/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 861, in _generate_and_score_completions
[rank0]:     self._metrics[mode]["reward_std"].append(std_grouped_rewards.mean().item())
[rank0]:                                              ^^^^^^^^^^^^^^^^^^^
[rank0]: UnboundLocalError: cannot access local variable 'std_grouped_rewards' where it is not associated with a value

qgallouedec · 2025-03-22T19:57:36Z

Thanks, it should be fixed now

add option not to scale the rewards

d7118bc

qgallouedec requested review from kashif, edbeeching, lewtun and shirinyamani March 22, 2025 19:11

kashif approved these changes Mar 22, 2025

View reviewed changes

fix for logging

fe6686a

qgallouedec merged commit 9b38b0b into main Mar 22, 2025
9 of 14 checks passed

qgallouedec deleted the dr-grpo branch March 22, 2025 20:47

Peng-YM mentioned this pull request Mar 25, 2025

[Feature Request]Support Dr. GRPO for Unbiased Optimization in RL Training volcengine/verl#742

Closed

lkevinzc mentioned this pull request Mar 25, 2025

some details and reproduction sail-sg/understand-r1-zero#6

Closed

kashif pushed a commit to kashif/trl that referenced this pull request Mar 28, 2025

⚖️ Add option not to scale rewards (Dr. GRPO) (huggingface#3135)

33f00a2

yxliu-TAMU pushed a commit to mincheolseong/ECEN743-GRPO-Project-Proposal that referenced this pull request Apr 20, 2025

⚖️ Add option not to scale rewards (Dr. GRPO) (huggingface#3135)

2b61a20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚖️ Add option not to scale rewards (Dr. GRPO) #3135

⚖️ Add option not to scale rewards (Dr. GRPO) #3135

Uh oh!

qgallouedec commented Mar 22, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Mar 22, 2025

Uh oh!

zaddy6 commented Mar 22, 2025

Uh oh!

qgallouedec commented Mar 22, 2025

Uh oh!

Uh oh!

Uh oh!

⚖️ Add option not to scale rewards (Dr. GRPO) #3135

⚖️ Add option not to scale rewards (Dr. GRPO) #3135

Uh oh!

Conversation

qgallouedec commented Mar 22, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Mar 22, 2025

Uh oh!

zaddy6 commented Mar 22, 2025

Uh oh!

qgallouedec commented Mar 22, 2025

Uh oh!

Uh oh!

Uh oh!