RLFromScratch

This repo implements Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) from scratch in PyTorch, without relying on off-the-shelf libraries like TRL or VERL.

GRPO paper: arXiv:2402.03300
DPO paper: arXiv:2305.18290

Why this repo

To open the black box: we unpack the training details—masking, KL penalties, scheduling, and evaluation—so you can see exactly how these algorithms work in practice.

Quick results

GRPO on Llama-3.2-1B-Instruct (GSM8K): ~10% → ~23% accuracy in 1 epoch.
DPO on Llama-3.2-1B using Tiny-Safe-Pair (safe-pair-data): ~50% → ~60% preference accuracy in 3 epochs.

Both evaluation pipelines are included.

Training setup

The scripts default to multi-GPU training with PyTorch DDP, and can be easily adapted to a single GPU by adjusting the launch command and disabling distributed initialization. The evaluation is preformed using a single GPU.

Training:

torchrun --standalone --nproc_per_node=8 dpo/grpo_train_from_scratch.py

Evaluation:
```
python dpo/grpo_evaluation.py
```

Algorithm Resources

I’ve written down explanation of the two algorithms in the following blogs:

DPO
GRPO

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.DS_Store		.DS_Store
README.md		README.md
dpo_evaluation.py		dpo_evaluation.py
dpo_train_from_scratch.py		dpo_train_from_scratch.py
grpo_evaluation.py		grpo_evaluation.py
grpo_train_from_scratch.py		grpo_train_from_scratch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RLFromScratch

Why this repo

Quick results

Training setup

Algorithm Resources

About

Uh oh!

Releases

Packages

Languages

mingyin0312/RLFromScratch

Folders and files

Latest commit

History

Repository files navigation

RLFromScratch

Why this repo

Quick results

Training setup

Algorithm Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages