GitHub - ustc-hyin/ClearSight: Code for paper: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models

ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models

Limitations of Contrastive Decoding

To address object hallucination, several Contrastive Decoding strategies have been introduced in recent years. Among these, Visual Contrastive Decoding (VCD) method has shown promise in reducing hallucinations by contrasting output distributions from both original and perturbed visual inputs, thus mitigating the model's excessive reliance on language priors.

However, these methods present two main limitations:

While reducing over-reliance on language priors, these methods may compromise the coherence and accuracy of generated content.
Contrastive decoding necessitates separate processing of the original and contrastive inputs, which considerably increases inference time.

Visual Amplification Fusion

To address these shortcomings, we hope to propose a training-free method that can effectively reduces hallucinations without compromising content quality or inference speed.

Our analysis indicates that modality fusion in MLLMs primarily occurs within the middle layers. Visual Amplification Fusion (VAF) specifically amplifies visual signals at these middle layers, enabling the model to capture more distinctive visual features during fusion, which in turn reduces false descriptions in generated text.

This technique not only strengthens the model's visual representations but also retains the beneficial influence of language priors, thus preserving content quality. Furthermore, by eliminating the need to process contrastive samples, VAF maintains inference speed.

Setup
Visual Neglect in Modal Fusion
VAF Inference & Evaluation

Setup

conda create -n clearsight python=3.10
conda activate clearsight
cd LLaVA
pip install -e .

Visual Neglect in Modal Fusion

We provide the following script to reproduce our analysis results on the over-reliance of multimodal large language models on linguistic priors.

bash ./visaug/analysis/vis_flow.sh

or

python ./visaug/analysis/vis_flow.py \
    --model-path /model/llava \
    --question-file ./data/pope/coco/coco_pope_random.json \
    --image-folder ./data/pope/coco/val2014 \
    --answers-file ./outputs/analysis/res_coco_random.pt

The analysis results are shown in the two figures below, from which we can draw two key conclusions:

The model performs the crucial fusion of visual and textual modalities in the middle layers, creating cross-modal semantic representations that drive the final predictions.
During this critical fusion process, the model demonstrates inadequate attention to the visual modality.

VAF Inference & Evaluation

We provide the following script to reproduce the experimental results of Visual Amplification Fusion method.

bash ./visaug/inference/infer_pope.sh
bash ./visaug/inference/eval_pope.sh

or

python ./visaug/inference/infer_pope.py \
    --model-path /model/llava \
    --question-file ./data/pope/coco/coco_pope_random.json \
    --image-folder ./data/pope/coco/val2014 \
    --answers-file ./outputs/inference/res_coco_random.jsonl \
    --use-visaug \
    --enh-para 1.15 \
    --sup-para 0.95 \

python ./visaug/inference/eval_pope.py \
    --annotation-file ./data/pope/coco/coco_pope_random.json\
    --result-file ./outputs/inference/res_coco_random.jsonl

Hallucination Mitigation Results (POPE-Random-Accurancy)

Method	LLaVA-7B	LLaVA-13B	Qwen-7B
Regular	87.8	87.6	88.2
VCD	88.4	88.9	89.1
VAF	89.7	90.1	90.0

Coherence of Generated Content (Nocaps-CIDEr)

Method	LLaVA-7B	LLaVA-13B
Regular	78.7	82.6
VCD	65.7	68.9
VAF	78.8	82.3

Inference Speed (Nocaps)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LLaVA		LLaVA
data/pope		data/pope
images		images
outputs		outputs
visaug		visaug
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models

Limitations of Contrastive Decoding

Visual Amplification Fusion

Setup

Visual Neglect in Modal Fusion

VAF Inference & Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ustc-hyin/ClearSight

Folders and files

Latest commit

History

Repository files navigation

ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models

Limitations of Contrastive Decoding

Visual Amplification Fusion

Setup

Visual Neglect in Modal Fusion

VAF Inference & Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages