☕ Overlong-filtering for GRPO #3248

shirinyamani · 2025-04-07T03:00:00Z

What does this PR do?

changes:

simplified completion_mask to not change anything in the loss compute to be compatible with the lig-loss,
intended to further extend it to remove the truncated_mask_completions from the Advtg

simple script to take it for spin;

from datasets import load_dataset
from trl.trainer.grpo_trainer import GRPOTrainer
from trl.trainer.grpo_config import GRPOConfig


dataset = load_dataset("trl-lib/tldr", split="train[:200]")


def reward_func(completions, **kwargs):
    return [len(set(c)) for c in completions]

args = GRPOConfig(
    output_dir="mask_truncated_ture_with_max_completion_length",
    use_vllm=False,
    bf16=True,
    gradient_checkpointing=True,
    logging_steps=50,
    mask_truncated_completions=True, 
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    num_generations=4,  
    max_completion_length=50,
)


trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B",
    args=args,
    reward_funcs=reward_func,
    train_dataset=dataset,
)


trainer.train()

then do;

accelerate launch 3248.py

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

HuggingFaceDocBuilderDev · 2025-04-07T03:04:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

docs/source/grpo_trainer.md

trl/trainer/grpo_config.py

shirinyamani

I don't think it's logged

🤔 hmm, any comment on how to fix it ?

trl/trainer/grpo_trainer.py

qgallouedec · 2025-04-07T21:46:18Z

The advantage calculation always takes truncated samples into account, doesn't it?

trl/trainer/grpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

trl/trainer/grpo_config.py

qgallouedec · 2025-04-08T14:37:06Z

There are still some minor things to do:

revert changes in docs/source/grpo_trainer.md and examples/scripts/sft_video_llm.py
fix CI
update PR name

shirinyamani · 2025-04-08T15:00:53Z

There are still some minor things to do:

revert changes in docs/source/grpo_trainer.md and examples/scripts/sft_video_llm.py

fix CI

update PR name

It is very surprising!! cuz I did
git checkout -- examples/scripts/sft_video_llm.py

qgallouedec · 2025-04-08T16:22:56Z

tests/test_grpo_trainer.py

+                learning_rate=0.1,
+                per_device_train_batch_size=3,
+                num_generations=3,
+                max_completion_length=8,


I'm curious to see if the CI will pass. What I foresee is that considering the small size of the completion, all completions are truncated, so they are all filtered and the model never updates.

yes actually the dataset in use has completions are too short, (initials are only 1 token)

Knew it!

FAILED tests/test_grpo_trainer.py::GRPOTrainerTester::test_training_with_mask_truncated_completions - AssertionError: True is not false : Parameter model.embed_tokens.weight has not changed.

451bec9 should fix it

tests/test_grpo_trainer.py

into overlong-filtering2

qgallouedec · 2025-04-08T18:35:07Z

Feel free to merge, CI failing is not related to this PR

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

shirinyamani and others added 9 commits April 6, 2025 19:23

overlong filtering

49ee0d6

help updated

583fa90

Update trl/trainer/grpo_config.py

f0b53dd

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Update trl/trainer/grpo_config.py

2ecbb43

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Update trl/trainer/grpo_config.py

425ff61

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

samples--> completions + %log removed

b734856

epsilon_high+masked_truc_comz added to logged metric

15431b3

link to paper added

8eba6ed

simplified completion mask added

93dfee7

shirinyamani added 2 commits April 7, 2025 03:22

avoid div by zero

1c68077

different test setups

e9a4cfd

qgallouedec mentioned this pull request Apr 7, 2025

Add support to new DAPO method #3130

Closed

final simplified added

aea335b

shirinyamani requested review from edbeeching, lewtun and qgallouedec April 7, 2025 21:39