Xiangyu Zhao*, Peiyuan Zhang*, Kexian Tang*, Xiaorong Zhu*,
Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia,
Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, Haodong Duan
If you find our work helpful, please consider giving us a ⭐ or citation 😊
- [2025/08/29] We have updated the results of Gemini-2.5-Flash-Image. The model now takes the top spot, surpassing GPT-Image-1.
- [2025/08/20] We have updated the results of Qwen-Image-Edit.
- [2025/08/07] We have updated the results of FLUX.1-Kontext-dev, thanks to @ErfeiCui.
- [2025/07/08] We’ve launched a HuggingFace Space that hosts every image generated during our model evaluations. Dive into the gallery and explore the visual diversity of RISEBench, just click and enjoy! Visit the gallery →
- [2025/06/15] RISEBench has been officially evaluated by BAGEL, achieving third-highest overall performance(Thinking Mode) with results comparable to Gemini-2.0. Check OfficialRepo for details about evaluation.
- [2025/05/27] We have released two versions of our benchmark suite: the full version, named RISEBench-360, and a smaller version, named RISEBench-64. The RISEBench-64 version is also available in our repository as an initial offering. Feel free to choose the version that best suits your needs! 😃
- [2025/05/27] Our paper has been updated! Please refer to Arxiv for comprehensive details.
- [2025/05/19] RISEBench Final Version(Scaled Up to 360 Samples) has been released! Please refer to HF-RISEBench for full data of RISEBench.
- [2025/04/08] RISEBench is Scaling Up! The final complete benchmark will be released soon. Stay tuned for updates!
- [2025/04/08] The benchmark and evaluation code have been released! Have fun 😃 .
- [2025/04/05] Our paper is released.
- [2025/04/05] The benchmark and evaluation code will be released soon.
In this work, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning.
To comprehensively assess model performance across diverse task types, we define three key evaluation dimensions: Instruction Reasoning, Appearance Consistency, and Visual Plausibility.
Besides, we design a robust LMM-as-a-Judge evaluation pipeline and leverage state-of-the-art LMMs(GPT-4o) to generate automated assessments. Our approach offers a scalable and reproducible alternative to human evaluation, while maintaining a high degree of alignment with human judgment.
As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems.
To evaluate the performance of representative visual editing approaches, we selected a diverse set of models spanning multiple model architectures and generation paradigms. Specifically, Flux1.0-Canny serves as a representative diffusion-based editing model, while EMU2 exemplifies the auto-regressive generation paradigm. We also include 8 proprietary models, including GPT-4o-Image, Gemini 2.0-Flash-Experimental, and Gemini 2.0-Flash-Preview. The outputs of proprietary models are given by the official API.
Model | Temporal (%) | Causal (%) | Spatial (%) | Logical (%) | Overall (%) |
---|---|---|---|---|---|
🏆 Gemini-2.5-Flash-Image | 25.9 | 47.8 | 37.0 | 18.8 | 32.8 |
🥈 GPT-Image-1 | 34.1 | 32.2 | 37.0 | 10.6 | 28.9 |
🥉 Gemini-2.0-Flash-exp | 8.2 | 15.5 | 23.0 | 4.7 | 13.3 |
BAGEL (w/ CoT) | 5.9 | 17.8 | 21.0 | 1.2 | 11.9 |
Gemini-2.0-Flash-pre | 10.6 | 13.3 | 11.0 | 2.3 | 9.4 |
Qwen-Image-Edit | 4.7 | 10.0 | 17.0 | 2.4 | 8.9 |
BAGEL | 2.4 | 5.6 | 14.0 | 1.2 | 6.1 |
FLUX.1-Kontext-Dev | 2.3 | 5.5 | 13.0 | 1.2 | 5.8 |
Ovis-U1 | 1.2 | 3.3 | 4.0 | 2.4 | 2.8 |
Step1X-Edit | 0.0 | 2.2 | 2.0 | 3.5 | 1.9 |
OmniGen | 1.2 | 1.0 | 0.0 | 1.2 | 0.8 |
EMU2 | 1.2 | 1.1 | 0.0 | 0.0 | 0.5 |
HiDream-Edit | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
FLUX.1-Canny | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Model | 🧠 Instruction Reasoning | 🪞 Appearance Consistency | 👁️ Visual Plausibility |
---|---|---|---|
🏆 Gemini-2.5-Flash-Image | 61.2 | 86.0 | 91.3 |
🥈 GPT-Image-1 | 62.8 | 80.2 | 94.9 |
🥉 Gemini-2.0-Flash-exp | 48.9 | 68.2 | 82.7 |
BAGEL (w/ CoT) | 45.9 | 73.8 | 80.1 |
Gemini-2.0-Flash-pre | 49.9 | 68.4 | 84.9 |
Qwen-Image-Edit | 37.2 | 66.4 | 86.9 |
BAGEL | 36.5 | 53.5 | 73.0 |
FLUX.1-Kontext-Dev | 26.0 | 71.6 | 85.2 |
Ovis-U1 | 33.9 | 52.7 | 72.9 |
HiDream-Edit | 30.3 | 12.6 | 74.9 |
Step1X-Edit | 25.1 | 41.5 | 73.5 |
EMU2 | 22.6 | 38.2 | 78.3 |
OmniGen | 22.0 | 32.6 | 55.3 |
FLUX.1-Canny | 20.2 | 13.1 | 77.5 |
The input images for the four categories are located in the data
directory. Each sample in the dataset contains an instruction
and an associated image
. You can use these inputs to generate the corresponding output image.
Output File Structure: Generated outputs should be saved in the following directory structure:
outputs/{MODEL_NAME}/images/{CATEGORY}/{INDEX_NAME}.{FORMAT}
{MODEL_NAME}
: The name of the model you are using (e.g.,gpt-4o
).{CATEGORY}
: The category of the sample (e.g.,temporal_reasoning
).{INDEX_NAME}
: The index of the sample in the dataset.{FORMAT}
: The file format of the output image (supported formats:.png
,.jpg
, or.jpeg
).
For example:
outputs/gpt-4o-native/images/temporal_reasoning/temporal_reasoning_1.png
Once all outputs are generated and saved in the specified format, you can evaluate them using the gpt_eval.py
script.
Open the gpt_eval.py
file and update the following parameters with your OpenAI credentials:
api_key
: Your OpenAI API key.api_base
: Your OpenAI API base URL (if applicable).
Execute the script using the following command:
python gpt_eval.py --data data/data_total.json --output outputs/MODEL_NAME
After running the script, three result files will be generated in the outputs/{MODEL_NAME}
directory:
{MODEL_NAME}_judge.csv
: A CSV file containing the total evaluation scores.{MODEL_NAME}_judge.xlsx
: An Excel file storing detailed responses from the GPT-4o judge model.{MODEL_NAME}.pkl
: A serialized pickle file saving the raw responses from the judge model, which can be used to resume or extend evaluations later.
We exhibit some outputs of the five models in the appendix. For more details, please refer to our paper.
If you find RISEBench useful, please cite using this BibTeX:
@article{zhao2025envisioning,
title={Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing},
author={Zhao, Xiangyu and Zhang, Peiyuan and Tang, Kexian and Li, Hao and Zhang, Zicheng and Zhai, Guangtao and Yan, Junchi and Yang, Hua and Yang, Xue and Duan, Haodong},
journal={arXiv preprint arXiv:2504.02826},
year={2025}
}