SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

[📖paper] [🤗SophiaVL-R1-7B model] [🤗Thinking Reward Model]

[🤗SophiaVL-R1-130k Dataset] [🤗SophiaVL-R1-Thinking-156k Dataset]

Intro

We introduce SophiaVL-R1 to explore the R1 paradigm using thinking-level rewards in vision-language reasoning, motivated by the phenomenon of "wrong thinking, correct answer"

To achieve this, we train a Thinking Reward Model to yield a reward that measures the thinking process from various dimensions, using our curated SophiaVL-R1-Thinking-156k dataset.

Besides, We introduce the Trust-GRPO algorithm, which assigns a trustworthiness weight to thinking rewards based on their reliability. This method guides the model to explore favorable reasoning policies in a trustworthy manner without extra computational overhead for uncertainty estimation.

Our SophiaVL-R1-7B model achieves strong performance across multiple benchmarks (e.g., 61.3% on MMMU) and can be efficiently trained on 8 A100 GPUs in just 1,500 steps using our SophiaVL-R1-Thinking-130k dataset.

Reqirements

Software Requirements

Python 3.9+
transformers>=4.51.0
flash-attn>=2.4.3
vllm>=0.8.3

Start with the following commands:

git clone https://github.com/kxfan2002/SophiaVL-R1.git
cd SophiaVL-R1  
conda create -n sophiavl python=3.10
conda activate sophiavl
pip install -r requirements.txt

Quick Start

Download the model

We recommend using huggingface-cli to download the model. You can use the following command to download the model:

# download huggingface-cli
pip install -U huggingface_hub
huggingface-cli login

# download the trained thinking reward model
huggingface-cli download bunny127/SophiaVL-R1-Thinking-Reward-Model-3B --local-dir <local_dir>

Dataset

We provide the SophiaVL-R1-130k Dataset for RL training and the SophiaVL-R1-Thinking-156k Dataset for the training of thinking reward model.

Download dataset:

# download huggingface-cli
pip install -U huggingface_hub
huggingface-cli login

huggingface-cli download bunny127/SophiaVL-R1-130k --repo-type dataset --local-dir <local_dir>

Our SophiaVL-R1-130k dataset encompasses a wide range of reasoning data.

Custom Dataset for Training

We support text-dataset and image-text dataset both in parquet and json file format. To train on your own datasets, please register your dataset in verl/data/dataset_info.json in the following format：

"myDataset":{
        "file_path":"/path/to/your/dataset",
        "image_base_path":"/your/image/base/path",
        "columns":{
            "column_reponses_to_prompt":"prompt",
            "column_reponses_to_answer":"answer",
            "column_reponses_to_images":"images"
        }
    },

Training

Training Scripts

To begin training, you first need to launch the Thinking Reward Model server using vllm:

bash scripts/train_scripts/thinking_reward_model.sh

This script launches our trained Thinking Reward Model and exposes it in the OpenAI API format.

If you want to train your own reward model, you may refer to this issue: #7

Next, set the following environment variables in scripts/train_scripts/run_train.sh so that the training script can access the reward model:

OPENAI_API_KEY: Key for Reward Model API
OPENAI_API_URL: URL for Reward Model API
REWARD_MODEL: Model name of Reward Model

Modify your training parameters in scripts/train_scripts/fullsets.yaml.

Finally, start the training process with:

bash scripts/train_scripts/run_train.sh

Merge Checkpoint in HuggingFace Format

After training, the saved checkpoints need to be merged before inference in EasyR1. This script will transfer the saved checkpoints to HuggingFace format.

python3 scripts/model_merger.py --local_dir checkpoints/easy_r1/exp_name/global_step_1/actor

Inference

We provide a simple inference script for you to test the model. The full script is here. Have a try with your data!

# download the trained reasoning model for direct inference
huggingface-cli download bunny127/SophiaVL-R1-7B --local-dir <local_dir>

# Modify the below fields to your test data
MODEL_PATH = "bunny127/SophiaVL-R1-7B" # or your local path
image_path = "/path/to/dataset/Math/CLEVR-Math/images/CLEVR_train_036427.png" # your local image path
prompt = "Subtract 0 cyan cubes. How many objects are left?"
question_type = "numerical" # numerical, multiple_choice, free-form, OCR

Evaluation

We use VLMEvalKit for evaluation of SophiaVL-R1. To register our model in VLMEvalKit, add model description in vlmeval/config.py:

"trained_model": partial(
        Qwen2VLChat,
        model_path="/path/to/model",
        min_pixels=1280 * 28 * 28,
        max_pixels=16384 * 28 * 28,
        use_custom_prompt=False,
    ),

We use the following systemt prompt for the evalutation of all models:

system_prompt="You FIRST think about the reasoning process as an internal monologue and then provide the final answer. Please think about this question as if you were a human pondering deeply. Engage in an internal dialogue using expressions such as 'let me think', 'wait', 'Hmm', 'oh, I see', 'let's break it down', etc, or other natural language thought expressions. It's encouraged to include self-reflection or verification in the reasoning process.The reasoning process MUST BE enclosed within <think> </think> tagsdd. The final answer MUST BE enclosed within <answer> </answer> tags, for example <think>your_thinking_process</think><answer>your_final_answer</answer>. If you use formula, please use LaTeX format.",

Performance of SophiaVL-R1-7B

SophiaVL-R1-7B demonstrates strong performance across multiple MLLM benchmarks, including both mathematical reasoning and general capability tasks.

Training Curves

This figure shows the accuracy reward curves during training. It is evident that SophiaVL-R1, trained with thinking-level rewards and Trust-GRPO, achieves significantly better training performance.

More Reasoning Examples of SophiaVL-R1

Acknowledgements

We sincerely appreciate the contributions of the open-source community. This work is built upon EasyR1.

Citations

If you find our work helpful for your research, please consider citing our work.

@article{fan2025sophiavl,
  title={SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward},
  author={Fan, Kaixuan and Feng, Kaituo and Lyu, Haoming and Zhou, Dongzhan and Yue, Xiangyu},
  journal={arXiv preprint arXiv:2505.17018},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github		.github
assets		assets
images		images
scripts		scripts
verl		verl
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SophiaVL-R1_paper.pdf		SophiaVL-R1_paper.pdf
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Intro

Reqirements

Software Requirements

Quick Start

Download the model

Dataset

Custom Dataset for Training

Training

Training Scripts

Merge Checkpoint in HuggingFace Format

Inference

Evaluation

Performance of SophiaVL-R1-7B

Training Curves

More Reasoning Examples of SophiaVL-R1

Acknowledgements

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

kxfan2002/SophiaVL-R1

Folders and files

Latest commit

History

Repository files navigation

SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Intro

Reqirements

Software Requirements

Quick Start

Download the model

Dataset

Custom Dataset for Training

Training

Training Scripts

Merge Checkpoint in HuggingFace Format

Inference

Evaluation

Performance of SophiaVL-R1-7B

Training Curves

More Reasoning Examples of SophiaVL-R1

Acknowledgements

Citations

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages