[📖paper]   [🤗SophiaVL-R1-7B model]   [🤗Thinking Reward Model]
[🤗SophiaVL-R1-130k Dataset]   [🤗SophiaVL-R1-Thinking-156k Dataset]
We introduce SophiaVL-R1 to explore the R1 paradigm using thinking-level rewards in vision-language reasoning, motivated by the phenomenon of "wrong thinking, correct answer"
To achieve this, we train a Thinking Reward Model to yield a reward that measures the thinking process from various dimensions, using our curated SophiaVL-R1-Thinking-156k dataset.
Besides, We introduce the Trust-GRPO algorithm, which assigns a trustworthiness weight to thinking rewards based on their reliability. This method guides the model to explore favorable reasoning policies in a trustworthy manner without extra computational overhead for uncertainty estimation.
Our SophiaVL-R1-7B model achieves strong performance across multiple benchmarks (e.g., 61.3% on MMMU) and can be efficiently trained on 8 A100 GPUs in just 1,500 steps using our SophiaVL-R1-Thinking-130k dataset.
- Python 3.9+
- transformers>=4.51.0
- flash-attn>=2.4.3
- vllm>=0.8.3
Start with the following commands:
git clone https://github.com/kxfan2002/SophiaVL-R1.git
cd SophiaVL-R1
conda create -n sophiavl python=3.10
conda activate sophiavl
pip install -r requirements.txt
We recommend using huggingface-cli to download the model. You can use the following command to download the model:
# download huggingface-cli
pip install -U huggingface_hub
huggingface-cli login
# download the trained thinking reward model
huggingface-cli download bunny127/SophiaVL-R1-Thinking-Reward-Model-3B --local-dir <local_dir>
We provide the SophiaVL-R1-130k Dataset for RL training and the SophiaVL-R1-Thinking-156k Dataset for the training of thinking reward model.
Download dataset:
# download huggingface-cli
pip install -U huggingface_hub
huggingface-cli login
huggingface-cli download bunny127/SophiaVL-R1-130k --repo-type dataset --local-dir <local_dir>
Our SophiaVL-R1-130k dataset encompasses a wide range of reasoning data.
We support text-dataset and image-text dataset both in parquet and json file format. To train on your own datasets, please register your dataset in verl/data/dataset_info.json
in the following format:
"myDataset":{
"file_path":"/path/to/your/dataset",
"image_base_path":"/your/image/base/path",
"columns":{
"column_reponses_to_prompt":"prompt",
"column_reponses_to_answer":"answer",
"column_reponses_to_images":"images"
}
},
To begin training, you first need to launch the Thinking Reward Model server using vllm
:
bash scripts/train_scripts/thinking_reward_model.sh
This script launches our trained Thinking Reward Model and exposes it in the OpenAI API format.
If you want to train your own reward model, you may refer to this issue: #7
Next, set the following environment variables in scripts/train_scripts/run_train.sh
so that the training script can access the reward model:
OPENAI_API_KEY
: Key for Reward Model APIOPENAI_API_URL
: URL for Reward Model APIREWARD_MODEL
: Model name of Reward Model
Modify your training parameters in scripts/train_scripts/fullsets.yaml
.
Finally, start the training process with:
bash scripts/train_scripts/run_train.sh
After training, the saved checkpoints need to be merged before inference in EasyR1. This script will transfer the saved checkpoints to HuggingFace format.
python3 scripts/model_merger.py --local_dir checkpoints/easy_r1/exp_name/global_step_1/actor
We provide a simple inference script for you to test the model. The full script is here. Have a try with your data!
# download the trained reasoning model for direct inference
huggingface-cli download bunny127/SophiaVL-R1-7B --local-dir <local_dir>
# Modify the below fields to your test data
MODEL_PATH = "bunny127/SophiaVL-R1-7B" # or your local path
image_path = "/path/to/dataset/Math/CLEVR-Math/images/CLEVR_train_036427.png" # your local image path
prompt = "Subtract 0 cyan cubes. How many objects are left?"
question_type = "numerical" # numerical, multiple_choice, free-form, OCR
We use VLMEvalKit for evaluation of SophiaVL-R1. To register our model in VLMEvalKit, add model description in vlmeval/config.py
:
"trained_model": partial(
Qwen2VLChat,
model_path="/path/to/model",
min_pixels=1280 * 28 * 28,
max_pixels=16384 * 28 * 28,
use_custom_prompt=False,
),
We use the following systemt prompt for the evalutation of all models:
system_prompt="You FIRST think about the reasoning process as an internal monologue and then provide the final answer. Please think about this question as if you were a human pondering deeply. Engage in an internal dialogue using expressions such as 'let me think', 'wait', 'Hmm', 'oh, I see', 'let's break it down', etc, or other natural language thought expressions. It's encouraged to include self-reflection or verification in the reasoning process.The reasoning process MUST BE enclosed within <think> </think> tagsdd. The final answer MUST BE enclosed within <answer> </answer> tags, for example <think>your_thinking_process</think><answer>your_final_answer</answer>. If you use formula, please use LaTeX format.",
SophiaVL-R1-7B demonstrates strong performance across multiple MLLM benchmarks, including both mathematical reasoning and general capability tasks.
This figure shows the accuracy reward curves during training. It is evident that SophiaVL-R1, trained with thinking-level rewards and Trust-GRPO, achieves significantly better training performance.
We sincerely appreciate the contributions of the open-source community. This work is built upon EasyR1.
If you find our work helpful for your research, please consider citing our work.
@article{fan2025sophiavl,
title={SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward},
author={Fan, Kaixuan and Feng, Kaituo and Lyu, Haoming and Zhou, Dongzhan and Yue, Xiangyu},
journal={arXiv preprint arXiv:2505.17018},
year={2025}
}