Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Xiangyu Zhao*, Peiyuan Zhang*, Kexian Tang*, Xiaorong Zhu*,

Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia,

Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, Haodong Duan

If you find our work helpful, please consider giving us a ⭐ or citation 😊

🎉 News

[2025/08/29] We have updated the results of Gemini-2.5-Flash-Image. The model now takes the top spot, surpassing GPT-Image-1.
[2025/08/20] We have updated the results of Qwen-Image-Edit.
[2025/08/07] We have updated the results of FLUX.1-Kontext-dev, thanks to @ErfeiCui.
[2025/07/08] We’ve launched a HuggingFace Space that hosts every image generated during our model evaluations. Dive into the gallery and explore the visual diversity of RISEBench, just click and enjoy! Visit the gallery →
[2025/06/15] RISEBench has been officially evaluated by BAGEL, achieving third-highest overall performance(Thinking Mode) with results comparable to Gemini-2.0. Check OfficialRepo for details about evaluation.
[2025/05/27] We have released two versions of our benchmark suite: the full version, named RISEBench-360, and a smaller version, named RISEBench-64. The RISEBench-64 version is also available in our repository as an initial offering. Feel free to choose the version that best suits your needs! 😃
[2025/05/27] Our paper has been updated! Please refer to Arxiv for comprehensive details.
[2025/05/19] RISEBench Final Version(Scaled Up to 360 Samples) has been released! Please refer to HF-RISEBench for full data of RISEBench.
[2025/04/08] RISEBench is Scaling Up! The final complete benchmark will be released soon. Stay tuned for updates!
[2025/04/08] The benchmark and evaluation code have been released! Have fun 😃 .
[2025/04/05] Our paper is released.
[2025/04/05] The benchmark and evaluation code will be released soon.

📖 Introduction

In this work, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning.

To comprehensively assess model performance across diverse task types, we define three key evaluation dimensions: Instruction Reasoning, Appearance Consistency, and Visual Plausibility.

Besides, we design a robust LMM-as-a-Judge evaluation pipeline and leverage state-of-the-art LMMs(GPT-4o) to generate automated assessments. Our approach offers a scalable and reproducible alternative to human evaluation, while maintaining a high degree of alignment with human judgment.

As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems.

🔥 Benchmark Performance

To evaluate the performance of representative visual editing approaches, we selected a diverse set of models spanning multiple model architectures and generation paradigms. Specifically, Flux1.0-Canny serves as a representative diffusion-based editing model, while EMU2 exemplifies the auto-regressive generation paradigm. We also include 8 proprietary models, including GPT-4o-Image, Gemini 2.0-Flash-Experimental, and Gemini 2.0-Flash-Preview. The outputs of proprietary models are given by the official API.

📊 Overall performance on RISEBench

Model	Temporal (%)	Causal (%)	Spatial (%)	Logical (%)	Overall (%)
🏆 Gemini-2.5-Flash-Image	25.9	47.8	37.0	18.8	32.8
🥈 GPT-Image-1	34.1	32.2	37.0	10.6	28.9
🥉 Gemini-2.0-Flash-exp	8.2	15.5	23.0	4.7	13.3
BAGEL (w/ CoT)	5.9	17.8	21.0	1.2	11.9
Gemini-2.0-Flash-pre	10.6	13.3	11.0	2.3	9.4
Qwen-Image-Edit	4.7	10.0	17.0	2.4	8.9
BAGEL	2.4	5.6	14.0	1.2	6.1
FLUX.1-Kontext-Dev	2.3	5.5	13.0	1.2	5.8
Ovis-U1	1.2	3.3	4.0	2.4	2.8
Step1X-Edit	0.0	2.2	2.0	3.5	1.9
OmniGen	1.2	1.0	0.0	1.2	0.8
EMU2	1.2	1.1	0.0	0.0	0.5
HiDream-Edit	0.0	0.0	0.0	0.0	0.0
FLUX.1-Canny	0.0	0.0	0.0	0.0	0.0

🎨 Comparison across models on three evaluation sub-dimensions

Model	🧠 Instruction Reasoning	🪞 Appearance Consistency	👁️ Visual Plausibility
🏆 Gemini-2.5-Flash-Image	61.2	86.0	91.3
🥈 GPT-Image-1	62.8	80.2	94.9
🥉 Gemini-2.0-Flash-exp	48.9	68.2	82.7
BAGEL (w/ CoT)	45.9	73.8	80.1
Gemini-2.0-Flash-pre	49.9	68.4	84.9
Qwen-Image-Edit	37.2	66.4	86.9
BAGEL	36.5	53.5	73.0
FLUX.1-Kontext-Dev	26.0	71.6	85.2
Ovis-U1	33.9	52.7	72.9
HiDream-Edit	30.3	12.6	74.9
Step1X-Edit	25.1	41.5	73.5
EMU2	22.6	38.2	78.3
OmniGen	22.0	32.6	55.3
FLUX.1-Canny	20.2	13.1	77.5

🛠️ Quick Start

1. Output Generation

The input images for the four categories are located in the data directory. Each sample in the dataset contains an instruction and an associated image. You can use these inputs to generate the corresponding output image.

Output File Structure: Generated outputs should be saved in the following directory structure:

outputs/{MODEL_NAME}/images/{CATEGORY}/{INDEX_NAME}.{FORMAT}

{MODEL_NAME}: The name of the model you are using (e.g., gpt-4o).
{CATEGORY}: The category of the sample (e.g., temporal_reasoning).
{INDEX_NAME}: The index of the sample in the dataset.
{FORMAT}: The file format of the output image (supported formats: .png, .jpg, or .jpeg).

For example: outputs/gpt-4o-native/images/temporal_reasoning/temporal_reasoning_1.png

2. Evaluation By GPT-4.1

Once all outputs are generated and saved in the specified format, you can evaluate them using the gpt_eval.py script.

Step 1: Configure API Settings

Open the gpt_eval.py file and update the following parameters with your OpenAI credentials:

api_key: Your OpenAI API key.
api_base: Your OpenAI API base URL (if applicable).

Step 2: Run the Evaluation Script

Execute the script using the following command:

python gpt_eval.py --data data/data_total.json --output outputs/MODEL_NAME

Step 3: Review the Results

After running the script, three result files will be generated in the outputs/{MODEL_NAME} directory:

{MODEL_NAME}_judge.csv: A CSV file containing the total evaluation scores.
{MODEL_NAME}_judge.xlsx: An Excel file storing detailed responses from the GPT-4o judge model.
{MODEL_NAME}.pkl: A serialized pickle file saving the raw responses from the judge model, which can be used to resume or extend evaluations later.

🔥 Outputs of Current Models

We exhibit some outputs of the five models in the appendix. For more details, please refer to our paper.

Citation

If you find RISEBench useful, please cite using this BibTeX:

@article{zhao2025envisioning,
  title={Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing},
  author={Zhao, Xiangyu and Zhang, Peiyuan and Tang, Kexian and Li, Hao and Zhang, Zicheng and Zhai, Guangtao and Yan, Junchi and Yang, Hua and Yang, Xue and Duan, Haodong},
  journal={arXiv preprint arXiv:2504.02826},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
data		data
images		images
outputs		outputs
.gitignore		.gitignore
README.md		README.md
gpt_eval.py		gpt_eval.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

🎉 News

📖 Introduction

🔥 Benchmark Performance

📊 Overall performance on RISEBench

🎨 Comparison across models on three evaluation sub-dimensions

🛠️ Quick Start

1. Output Generation

2. Evaluation By GPT-4.1

Step 1: Configure API Settings

Step 2: Run the Evaluation Script

Step 3: Review the Results

🔥 Outputs of Current Models

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

PhoenixZ810/RISEBench

Folders and files

Latest commit

History

Repository files navigation

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

🎉 News

📖 Introduction

🔥 Benchmark Performance

📊 Overall performance on RISEBench

🎨 Comparison across models on three evaluation sub-dimensions

🛠️ Quick Start

1. Output Generation

2. Evaluation By GPT-4.1

Step 1: Configure API Settings

Step 2: Run the Evaluation Script

Step 3: Review the Results

🔥 Outputs of Current Models

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages