We introduce VAP (Visual Adversarial Perturbation), a novel approach that strategically injects beneficial visual noise to mitigate object hallucinations in LVMs, without altering the complex base model. Our method consistently enhances robustness across 8 state-of-the-art LVMs under the rigorous POPE hallucination evaluation.
Examples of the vision-question-answer (VQA) tasks before and after applying our proposed method to the original images. (a) and (b) demonstrates the suppression of hallucinations in vision-/text-axis evaluations. (c) and (d) shows the reduction of hallucinations in open-ended tasks..
π’ [2025-02-02] VAP is now open-source! Explore our Visual Adversarial Perturbation (VAP) method to mitigate object hallucinations in LVMs. Check out the repo and get started! π₯
π’ [2025-02-03] Our paper βPoison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMsβ is now available! π
We evaluate eight state-of-the-art Large Vision-Language Models (LVMs) to validate the efficacy of our proposed approach. These models, spanning significant advancements in multimodal AI from September 2023 to December 2024, range from 7.1B to 16.1B parameters.
Model | Parameters | Language Model | Vision Model | Release Date |
---|---|---|---|---|
LLaVA-v1.5 | 7.1B | Vicuna-7B | CLIP ViT-L/14 | 2023-09 |
Instruct-BLIP | 7.9B | Vicuna-7B | ViT-G | 2023-09 |
Intern-VL2 | 8.1B | InternLM2.5-7B | InternViT-300M | 2024-07 |
Intern-VL2-MPO | 8.1B | InternLM2.5-7B | InternViT-300M | 2024-11 |
DeepSeek-VL2 | 16.1B | DeepSeekMoE-16B | SigLIP-400M | 2024-12 |
Qwen-VL2 | 8.3B | Qwen2-7B | ViT-Qwen | 2024-08 |
LLaVA-OV | 8.0B | Qwen2-7B | SigLIP-400M | 2024-08 |
Ovis1.6-Gemma2 | 9.4B | Gemma2-9B | SigLIP-400M | 2024-11 |
POPE Benchmark: Evaluation triplets are sampling from the MS-COCO dataset and are formatted in the following JSON structure:
Example JSON format:
{
"id": 0,
"image": "name", // The image filename from the dataset
"question": "XXX", // The generated question related to the image
"gt": "yes/no" // The ground truth answer (binary: "yes" or "no")
}
BEAF Benchmark: This dataset follows the format specified in the BEAF repository.
This section provides detailed instructions for setting up the environments required to run various LVMs. We ensure compatibility across different models by structuring the setup efficiently, leveraging shared dependencies where applicable.
- Model: liuhaotian/llava-v1.5-7b
- Environment Setup:
cd env_setting/LLaVA
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip
pip install -e .
pip install ftfy regex tqdm
pip install protobuf transformers_stream_generator matplotlib
- Models:
- Shared Environment Setup:
conda create -n internvl python=3.9 -y
conda activate internvl
pip install lmdeploy==0.5.3
pip install timm ftfy regex tqdm matplotlib
pip install flash-attn==2.3.6 --no-build-isolation
- Model:
- Shared Environment Setup:
cd env_setting/LLaVA-NeXT
conda create -n llava_onevision python=3.10 -y
conda activate llava_onevision
pip install --upgrade pip
pip install -e ".[train]"
pip install ftfy regex tqdm matplotlib
pip install transformers==4.47.0
pip install flash-attn==2.5.2
- Model: Qwen/Qwen2-VL-7B-Instruct
- Environment Setup:
conda create -n qwen python==3.11 -y
conda activate qwen
pip install transformers
pip install qwen-vl-utils
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install accelerate==0.26.1
pip install ftfy regex tqdm matplotlib
- Model: deepseek-ai/deepseek-vl2
- Environment Setup:
conda create -n deepseek python==3.10 -y
conda activate deepseek
cd env_setting/DeepSeek-VL2/
pip install -e .
pip install ftfy regex tqdm matplotlib
pip install flash_attn==2.5.8
pip install --force-reinstall --no-deps --pre xformers
pip install transformers==4.38.2
This setup ensures a streamlined development environment while minimizing conflicts across dependencies. For further optimizations, consider leveraging CUDA-based optimizations, distributed inference, and efficient memory management strategies.
- Visual Adversarial Perturbation (VAP) Execution
To generate visual noise for mitigating hallucinations in LVMs, run the following command:
bash script/VAP.sh
- Hallucination Evaluation
To assess the modelβs performance on hallucination benchmarks, execute:
bash script/evaluate.sh
If you find our work helpful, please consider citing our paper:
@article{zhang2025poison,
title={Poison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs},
author={Zhang, Kejia and Tao, Keda and Tang, Jiasheng and Wang, Huan},
journal={arXiv preprint arXiv:2501.19164},
year={2025}
}
Your citation helps support our research and further advances the field of reliable vision-language models. π