Skip to content

Official Repository of paper: Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Notifications You must be signed in to change notification settings

PhoenixZ810/RISEBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Xiangyu Zhao*, Peiyuan Zhang*, Kexian Tang*, Xiaorong Zhu*,

Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia,

Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, Haodong Duan

arXiv PDF data img/data data img/data data img/data

If you find our work helpful, please consider giving us a ⭐ or citation 😊

🎉 News

  • [2025/08/29] We have updated the results of Gemini-2.5-Flash-Image. The model now takes the top spot, surpassing GPT-Image-1.
  • [2025/08/20] We have updated the results of Qwen-Image-Edit.
  • [2025/08/07] We have updated the results of FLUX.1-Kontext-dev, thanks to @ErfeiCui.
  • [2025/07/08] We’ve launched a HuggingFace Space that hosts every image generated during our model evaluations. Dive into the gallery and explore the visual diversity of RISEBench, just click and enjoy! Visit the gallery →
  • [2025/06/15] RISEBench has been officially evaluated by BAGEL, achieving third-highest overall performance(Thinking Mode) with results comparable to Gemini-2.0. Check OfficialRepo for details about evaluation.
  • [2025/05/27] We have released two versions of our benchmark suite: the full version, named RISEBench-360, and a smaller version, named RISEBench-64. The RISEBench-64 version is also available in our repository as an initial offering. Feel free to choose the version that best suits your needs! 😃
  • [2025/05/27] Our paper has been updated! Please refer to Arxiv for comprehensive details.
  • [2025/05/19] RISEBench Final Version(Scaled Up to 360 Samples) has been released! Please refer to HF-RISEBench for full data of RISEBench.
  • [2025/04/08] RISEBench is Scaling Up! The final complete benchmark will be released soon. Stay tuned for updates!
  • [2025/04/08] The benchmark and evaluation code have been released! Have fun 😃 .
  • [2025/04/05] Our paper is released.
  • [2025/04/05] The benchmark and evaluation code will be released soon.

📖 Introduction

In this work, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning.

To comprehensively assess model performance across diverse task types, we define three key evaluation dimensions: Instruction Reasoning, Appearance Consistency, and Visual Plausibility.

Besides, we design a robust LMM-as-a-Judge evaluation pipeline and leverage state-of-the-art LMMs(GPT-4o) to generate automated assessments. Our approach offers a scalable and reproducible alternative to human evaluation, while maintaining a high degree of alignment with human judgment.

As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems.

🔥 Benchmark Performance

To evaluate the performance of representative visual editing approaches, we selected a diverse set of models spanning multiple model architectures and generation paradigms. Specifically, Flux1.0-Canny serves as a representative diffusion-based editing model, while EMU2 exemplifies the auto-regressive generation paradigm. We also include 8 proprietary models, including GPT-4o-Image, Gemini 2.0-Flash-Experimental, and Gemini 2.0-Flash-Preview. The outputs of proprietary models are given by the official API.

📊 Overall performance on RISEBench

Model Temporal (%) Causal (%) Spatial (%) Logical (%) Overall (%)
🏆 Gemini-2.5-Flash-Image25.947.837.018.832.8
🥈 GPT-Image-134.132.237.010.628.9
🥉 Gemini-2.0-Flash-exp8.215.523.04.713.3
BAGEL (w/ CoT)5.917.821.01.211.9
Gemini-2.0-Flash-pre10.613.311.02.39.4
Qwen-Image-Edit4.710.017.02.48.9
BAGEL2.45.614.01.26.1
FLUX.1-Kontext-Dev2.35.513.01.25.8
Ovis-U11.23.34.02.42.8
Step1X-Edit0.02.22.03.51.9
OmniGen1.21.00.01.20.8
EMU21.21.10.00.00.5
HiDream-Edit0.00.00.00.00.0
FLUX.1-Canny0.00.00.00.00.0

🎨 Comparison across models on three evaluation sub-dimensions

Model 🧠 Instruction Reasoning 🪞 Appearance Consistency 👁️ Visual Plausibility
🏆 Gemini-2.5-Flash-Image61.286.091.3
🥈 GPT-Image-162.880.294.9
🥉 Gemini-2.0-Flash-exp48.968.282.7
BAGEL (w/ CoT)45.973.880.1
Gemini-2.0-Flash-pre49.968.484.9
Qwen-Image-Edit37.266.486.9
BAGEL36.553.573.0
FLUX.1-Kontext-Dev26.071.685.2
Ovis-U133.952.772.9
HiDream-Edit30.312.674.9
Step1X-Edit25.141.573.5
EMU222.638.278.3
OmniGen22.032.655.3
FLUX.1-Canny20.213.177.5

🛠️ Quick Start

1. Output Generation

The input images for the four categories are located in the data directory. Each sample in the dataset contains an instruction and an associated image. You can use these inputs to generate the corresponding output image.

Output File Structure: Generated outputs should be saved in the following directory structure:

outputs/{MODEL_NAME}/images/{CATEGORY}/{INDEX_NAME}.{FORMAT}

  • {MODEL_NAME}: The name of the model you are using (e.g., gpt-4o).
  • {CATEGORY}: The category of the sample (e.g., temporal_reasoning).
  • {INDEX_NAME}: The index of the sample in the dataset.
  • {FORMAT}: The file format of the output image (supported formats: .png, .jpg, or .jpeg).

For example: outputs/gpt-4o-native/images/temporal_reasoning/temporal_reasoning_1.png

2. Evaluation By GPT-4.1

Once all outputs are generated and saved in the specified format, you can evaluate them using the gpt_eval.py script.

Step 1: Configure API Settings

Open the gpt_eval.py file and update the following parameters with your OpenAI credentials:

  • api_key: Your OpenAI API key.
  • api_base: Your OpenAI API base URL (if applicable).

Step 2: Run the Evaluation Script

Execute the script using the following command:

python gpt_eval.py --data data/data_total.json --output outputs/MODEL_NAME

Step 3: Review the Results

After running the script, three result files will be generated in the outputs/{MODEL_NAME} directory:

  1. {MODEL_NAME}_judge.csv: A CSV file containing the total evaluation scores.
  2. {MODEL_NAME}_judge.xlsx: An Excel file storing detailed responses from the GPT-4o judge model.
  3. {MODEL_NAME}.pkl: A serialized pickle file saving the raw responses from the judge model, which can be used to resume or extend evaluations later.

🔥 Outputs of Current Models

We exhibit some outputs of the five models in the appendix. For more details, please refer to our paper.

Citation

If you find RISEBench useful, please cite using this BibTeX:

@article{zhao2025envisioning,
  title={Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing},
  author={Zhao, Xiangyu and Zhang, Peiyuan and Tang, Kexian and Li, Hao and Zhang, Zicheng and Zhai, Guangtao and Yan, Junchi and Yang, Hua and Yang, Xue and Duan, Haodong},
  journal={arXiv preprint arXiv:2504.02826},
  year={2025}
}

About

Official Repository of paper: Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages