GitHub - zli12321/Vision-SR1: Reinforcement Learning of Vision Language Models with Self Visual Perception Reward

Vision-SR1: Self-Rewarding Vision-Language Model via Reasoning Decomposition

Models:
🤗 Vision-SR1-7B | 🤗 Vision-SR1-7B-Cold-Start

Datasets:
📊 Vision-SR1-Cold-Start-9K | 📊 Vision-SR1-47K

LLM evaluation scripts and model generation outputs with LLM judgments is coming, stay tuned!

👀 About Vision-SR1

Vision-SR1 is a self-rewarded RL training framework to decompose VLMs' language reasoning into visual perception reasoning and language reasoning. Inspired by the awesome works of e.g. Vision-R1, Visionary-R1, R1-VL, we leverage VLM's self evolving and reasoning ability to Reward Itself.

Because VLMs fuse the vision encoder with the LLM backbone only late in pretraining, they often rely primarily on language reasoning rather than visual perception. Standard RL training tends to recall prior language knowledge for accuracy gains while neglecting vision. External LLM-based perception rewards can help but introduce bias and heavy latency. We instead propose a self-reward framework, enabling the model to provide its own visual and reasoning feedback with no latency.

Besides vision decomposition, We constructed two datasets: Vsion-SR1-Cold-9K for SFT and Vision-SR1-47K for RL.

🔍 Dataset

Our training dataset is sourced from 23 sources and evenly split across three main areas-- general visual understanding, science knowledge, multimodal mathematical reasoning.

Requirements

The code base adopted from verl and EasyR1.

Software Requirements

Python 3.9+
transformers=4.49.0

RL Training Setup

git clone https://github.com/zli12321/Vision-SR1.git
cd Vision-SR1
conda create -n Vision-SR1 python=3.11
bash setup.sh

GRPO Training

### Self-Reward Vision-SR1 GRPO Training
bash ./train_examples/2-7b_selfReward_train.sh

### Vision-SR1 regular training
bash ./train_examples/1-7b_visionR1_train.sh

Merge checkpoints

python3 scripts/model_merger.py --local_dir checkpoints/easy_r1/exp_name/global_step_1/actor

Evaluation & LLM-as-a-Judge Evaluation

NOTE 1: We use Gemini-2.5-flash as the Judge. Different LLM judges will result in different evaluation results. For reference, we also comput the rule-based evaluation accuracies, which is lower than LLM-as-Judges on Math datasets.
NOTE 2: We only use LLM-as-a-Judge for some of the datasets. For multiple choice datasets mmmu-pro-vision, mmmu-pro-10-options, visnumbench, hallusionbench, we use string matching to save time and costs.

Using Existing LLM Evaluations

We provide all the historic LLM generations for a quick reference and access to the results

python download_precomputed_evaluation_files.py
cd Evaluation
./get_eval_result.sh

Generating Evaluation Responses for the models

bash ./validation_examples/2-seethink_format_eval.sh

Use LLM-as-a-judge to generate result

cd Evaluation
python LLM_eval.py --input_dir ./Raw-Outputs/7B-Vision-SR1(The folder that contains the generated responses) --output_dir ./Raw-Outputs/LLM-Eval-out/7B-Vision-SR1(The output folder with LLM responses)

For LLM-as-a-judge, check Evaluation/utils/gemini_eval.py. You can implement the generate() function that uses any LLM to evaluate.

Compute Evaluation Results

python eval.py --llm_eval_dir ./Raw-Outputs/7B-Vision-SR1(The LLM Eval output responses) --mcq_dir ./Raw-Outputs/LLM-Eval-out/7B-Vision-SR1(The MCQ Eval Responses)

Reward Progression in training

Supervised Finetuning

The supervised finetuning code is adopted from LLaMA-Factory for easy setup.

Download the filtered SFT format data

python download-sft-data.py

Setup

conda create -n SFT python=3.11
cd LLaMA-Factory-Cold-Start
pip install -e ".[torch,metrics]" --no-build-isolation

pip install --upgrade huggingface_hub
huggingface-cli login

Training

FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/Vision-SR1-Cold-Start.yaml

Troubleshoot

If you still encounter errors after you follow th setup, simply clone the original LLaMA-Factory repo and follow their setup. Download the dataset and place into the LLaMA-Factory data folder. Place the Vision-SR1-Cold-Start.yaml file into the LLaMA-Factory SFT training folder.

Hardware Requirements

* estimated

Method	Bits	3B	7B
GRPO Full Fine-Tuning	AMP	4 or 8*40GB	4 or 8*80GB

Note

Use worker.actor.fsdp.torch_dtype=bf16 and worker.actor.optim.strategy=adamw_bf16 to enable bf16 training with fewer memory.

Custom Dataset

Please refer to the example datasets to prepare your own dataset.

Text dataset: https://huggingface.co/datasets/hiyouga/math12k
Image-text dataset: https://huggingface.co/datasets/hiyouga/geometry3k
Multi-image-text dataset: https://huggingface.co/datasets/hiyouga/journeybench-multi-image-vqa

Citation

If you find our works helpful, please cite

@misc{li2025selfrewardingvisionlanguagemodelreasoning,
      title={Self-Rewarding Vision-Language Model via Reasoning Decomposition}, 
      author={Zongxia Li and Wenhao Yu and Chengsong Huang and Rui Liu and Zhenwen Liang and Fuxiao Liu and Jingxi Che and Dian Yu and Jordan Boyd-Graber and Haitao Mi and Dong Yu},
      year={2025},
      eprint={2508.19652},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.19652}, 
}

We recommend to also cite the sourcecode work.

@misc{zheng2025easyr1,
  title        = {EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework},
  author       = {Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong},
  howpublished = {\url{https://github.com/hiyouga/EasyR1}},
  year         = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github		.github
Evaluation		Evaluation
LLaMA-Factory-Cold-Start		LLaMA-Factory-Cold-Start
assets		assets
scripts		scripts
train_examples		train_examples
validation_examples		validation_examples
verl		verl
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Dockerfile.legacy		Dockerfile.legacy
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.py		config.py
configuration_utils.py		configuration_utils.py
download-sft-data.py		download-sft-data.py
download_precomputed_evaluation_files.py		download_precomputed_evaluation_files.py
merge.sh		merge.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vision-SR1: Self-Rewarding Vision-Language Model via Reasoning Decomposition

👀 About Vision-SR1

🔍 Dataset

Requirements

Software Requirements

RL Training Setup

GRPO Training

Merge checkpoints

Evaluation & LLM-as-a-Judge Evaluation

Using Existing LLM Evaluations

Generating Evaluation Responses for the models

Use LLM-as-a-judge to generate result

Compute Evaluation Results

Reward Progression in training

Supervised Finetuning

Download the filtered SFT format data

Setup

Training

Troubleshoot

Hardware Requirements

Custom Dataset

Citation

About

Uh oh!

Releases

Packages

Languages

License

zli12321/Vision-SR1

Folders and files

Latest commit

History

Repository files navigation

Vision-SR1: Self-Rewarding Vision-Language Model via Reasoning Decomposition

👀 About Vision-SR1

🔍 Dataset

Requirements

Software Requirements

RL Training Setup

GRPO Training

Merge checkpoints

Evaluation & LLM-as-a-Judge Evaluation

Using Existing LLM Evaluations

Generating Evaluation Responses for the models

Use LLM-as-a-judge to generate result

Compute Evaluation Results

Reward Progression in training

Supervised Finetuning

Download the filtered SFT format data

Setup

Training

Troubleshoot

Hardware Requirements

Custom Dataset

Citation

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages