VISTA: Visual Information Steering with Token-logit Augmentation

This is the official implementation of the paper "The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering".

Overview

VISTA is a training-free inference-time intervention framework that reduces hallucination in Large Vision-Language Models (LVLMs) while promoting genuine information. Our approach reveals and addresses three key patterns in how LVLMs process information:

Gradual Visual Information Loss: Visually grounded tokens gradually become less favored throughout generation
Early Excitation: Semantically meaningful tokens achieve peak activation in layers earlier than the final layer
Hidden Genuine Information: Visually grounded tokens maintain relatively high rankings at inference

VISTA combines two complementary approaches:

Visual Steering Vector (VSV): Reinforces visual information in activation space
Self-Logits Augmentation (SLA): Leverages early layer activations to promote semantically meaningful decoding

Key Features

Training-free inference-time intervention
No external supervision required
Compatible with various decoding strategies
Applicable across multiple LVLM architectures
Reduces hallucination by ~40% on evaluated open-ended tasks

Installation

# Clone the repository
git clone https://github.com/LzVv123456/VISTA
cd VISTA

# Create and activate the virtual environment
conda env create -f environment.yml

Prepare Data

Download MSCOCO 2014 dataset from the official website and extract it to your data directory.

Usage

# For CHAIR evaluation.
bash run_chair.sh

# For POPE evaluation (specify split with --pope-type).
bash run_pope.sh

# For mmhal evaluation.
bash run_mmhal.sh

Please check the corresponding bash script for how to read results.

Configuration Options

Model Selection: Use "--model" to specify the target LVLM (supported: "llava-1.5", "instructblip", "shikra", "minigpt-4")
Visual Steering Vector (VSV ): Enable with "--vsv" and control strength via "--vsv-lambda"
Self-Logits Augmentation (SLA): Enable with "--logits-aug", configure target layers with "--logits-layers" and mixing ratio with "--logits-alpha"

Best Practices

VSV is designed to counteract Gradual Visual Information Loss and is suitable for open-ended generation tasks. Different LVLMs favor different lambda scales, so users should calibrate the scale when using new architectures. The --vsv-lambda parameter provides a flexible way to adjust the model from being more aggressive (more hallucination) to more conservative.
The impact of SLA depends on both target layers and the strength of --logits-alpha. A rule of thumb is to use a smaller alpha for larger window sizes and vice versa (see Table 4 in the paper).

Citation

@misc{li2025hiddenlifetokensreducing,
      title={The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering}, 
      author={Zhuowei Li and Haizhou Shi and Yunhe Gao and Di Liu and Zhenting Wang and Yuxiao Chen and Ting Liu and Long Zhao and Hao Wang and Dimitris N. Metaxas},
      year={2025},
      eprint={2502.03628},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.03628}, 
}

Acknowledgement

This project builds upon the following excellent works:

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
MMHal-Bench		MMHal-Bench
assets		assets
llava		llava
minigpt4		minigpt4
mllm		mllm
pope_coco		pope_coco
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
anchor.py		anchor.py
chair_ans.py		chair_ans.py
chair_eval.py		chair_eval.py
environment.yml		environment.yml
eval_data_loader.py		eval_data_loader.py
llm_layers.py		llm_layers.py
mmhal_ans.py		mmhal_ans.py
mmhal_eval.py		mmhal_eval.py
model_loader.py		model_loader.py
myutils.py		myutils.py
pope_ans.py		pope_ans.py
pope_eval.py		pope_eval.py
requirements.txt		requirements.txt
run_chair.sh		run_chair.sh
run_mmhal.sh		run_mmhal.sh
run_pope.sh		run_pope.sh
steering_vector.py		steering_vector.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VISTA: Visual Information Steering with Token-logit Augmentation

Overview

Key Features

Installation

Prepare Data

Usage

Configuration Options

Best Practices

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

LzVv123456/VISTA

Folders and files

Latest commit

History

Repository files navigation

VISTA: Visual Information Steering with Token-logit Augmentation

Overview

Key Features

Installation

Prepare Data

Usage

Configuration Options

Best Practices

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages