Skip to content

KejiaZhang-Robust/VAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Poison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs

What doesn't kill me makes me stronger!

Kejia Zhang, Keda Tao, Jiasheng Tang, Huan Wang

Westlake University Logo

πŸ“– Paper Teaser

VAP Teaser

We introduce VAP (Visual Adversarial Perturbation), a novel approach that strategically injects beneficial visual noise to mitigate object hallucinations in LVMs, without altering the complex base model. Our method consistently enhances robustness across 8 state-of-the-art LVMs under the rigorous POPE hallucination evaluation.

🎬 Demo: VAP in Action

VAP Demo

Examples of the vision-question-answer (VQA) tasks before and after applying our proposed method to the original images. (a) and (b) demonstrates the suppression of hallucinations in vision-/text-axis evaluations. (c) and (d) shows the reduction of hallucinations in open-ended tasks..


πŸš€ News

πŸ“’ [2025-02-02] VAP is now open-source! Explore our Visual Adversarial Perturbation (VAP) method to mitigate object hallucinations in LVMs. Check out the repo and get started! πŸ”₯

πŸ“’ [2025-02-03] Our paper β€œPoison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs” is now available! πŸŽ‰


Code Implementation Overview

1. Baseline LVMs Details

We evaluate eight state-of-the-art Large Vision-Language Models (LVMs) to validate the efficacy of our proposed approach. These models, spanning significant advancements in multimodal AI from September 2023 to December 2024, range from 7.1B to 16.1B parameters.

Model Parameters Language Model Vision Model Release Date
LLaVA-v1.5 7.1B Vicuna-7B CLIP ViT-L/14 2023-09
Instruct-BLIP 7.9B Vicuna-7B ViT-G 2023-09
Intern-VL2 8.1B InternLM2.5-7B InternViT-300M 2024-07
Intern-VL2-MPO 8.1B InternLM2.5-7B InternViT-300M 2024-11
DeepSeek-VL2 16.1B DeepSeekMoE-16B SigLIP-400M 2024-12
Qwen-VL2 8.3B Qwen2-7B ViT-Qwen 2024-08
LLaVA-OV 8.0B Qwen2-7B SigLIP-400M 2024-08
Ovis1.6-Gemma2 9.4B Gemma2-9B SigLIP-400M 2024-11

2. Hallucination Evaluation Dataset

POPE Benchmark: Evaluation triplets are sampling from the MS-COCO dataset and are formatted in the following JSON structure:

Example JSON format:

{
  "id": 0,
  "image": "name", // The image filename from the dataset
  "question": "XXX", // The generated question related to the image
  "gt": "yes/no" // The ground truth answer (binary: "yes" or "no")
}

BEAF Benchmark: This dataset follows the format specified in the BEAF repository.


3. Environment Setup

This section provides detailed instructions for setting up the environments required to run various LVMs. We ensure compatibility across different models by structuring the setup efficiently, leveraging shared dependencies where applicable.

LLaVA-v1.5

cd env_setting/LLaVA
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip
pip install -e .
pip install ftfy regex tqdm
pip install protobuf transformers_stream_generator matplotlib

Instruct-BLIP & InternVL Series

conda create -n internvl python=3.9 -y
conda activate internvl
pip install lmdeploy==0.5.3
pip install timm ftfy regex tqdm matplotlib
pip install flash-attn==2.3.6 --no-build-isolation

LLaVA-OneVision & Ovis1.6-Gemma2

cd env_setting/LLaVA-NeXT
conda create -n llava_onevision python=3.10 -y
conda activate llava_onevision
pip install --upgrade pip
pip install -e ".[train]"
pip install ftfy regex tqdm matplotlib
pip install transformers==4.47.0
pip install flash-attn==2.5.2

Qwen2-VL Series

conda create -n qwen python==3.11 -y
conda activate qwen
pip install transformers
pip install qwen-vl-utils
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install accelerate==0.26.1
pip install ftfy regex tqdm matplotlib

DeepSeek-VL2

conda create -n deepseek python==3.10 -y
conda activate deepseek
cd env_setting/DeepSeek-VL2/
pip install -e .
pip install ftfy regex tqdm matplotlib
pip install flash_attn==2.5.8
pip install --force-reinstall --no-deps --pre xformers
pip install transformers==4.38.2

This setup ensures a streamlined development environment while minimizing conflicts across dependencies. For further optimizations, consider leveraging CUDA-based optimizations, distributed inference, and efficient memory management strategies.

4. Running the Code

  1. Visual Adversarial Perturbation (VAP) Execution

To generate visual noise for mitigating hallucinations in LVMs, run the following command:

bash script/VAP.sh
  1. Hallucination Evaluation

To assess the model’s performance on hallucination benchmarks, execute:

bash script/evaluate.sh

πŸ“Œ Citation

If you find our work helpful, please consider citing our paper:

@article{zhang2025poison,
  title={Poison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs},
  author={Zhang, Kejia and Tao, Keda and Tang, Jiasheng and Wang, Huan},
  journal={arXiv preprint arXiv:2501.19164},
  year={2025}
}

Your citation helps support our research and further advances the field of reliable vision-language models. πŸš€

About

Poison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •