Dismantling Pathological Shortcuts: A Causal Framework for Faithful LVLM Decoding

arXiv cs.CV·Liu Yu, Can Chen, Ping Kuang, Zhikun Feng, Fan Zhou, Gillian Dobbie

2d ago

·~2 min·6/29/2026·en·0

Quick Answer

The paper introduces Fox, a novel inference-time framework that addresses hallucination in Large Vision-Language Models (LVLMs) by diagnosing structural misalignment and severing risky shortcuts.

Quick Take

The paper introduces Fox, a novel inference-time framework that addresses hallucination in Large (LVLMs) by diagnosing structural misalignment and severing risky shortcuts. Fox outperforms the previous state-of-the-art method, SID, by 29.1% while maintaining linguistic richness, showcasing its effectiveness in enhancing model reliability.

Key Points

Fox identifies risky mediators using a visual attention entropy probe.
The framework executes causal interventions via numerical logit saturation.
Fox achieves state-of-the-art performance, surpassing SID by 29.1%.
The approach maintains linguistic richness while enhancing model faithfulness.
Code for Fox is publicly available for further research.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 25 Jun 2026]

View PDF HTML (experimental)

Abstract:Large Vision-Language Models (LVLMs) exhibit sophisticated reasoning but remain susceptible to object hallucination. Deviating from the prevailing attention intensity assumption, we reveal a deeper dynamic structural misalignment: hallucination is triggered at decision-critical steps where specific attention heads, acting as risky mediators, decouple from visual evidence to lock onto language priors. This establishes a pathological shortcut that bypasses visual grounding. To dismantle this, we propose Fox (Faithfulness and Observational-flow via eXpression-rectification), a training-free inference-time framework. Fox diagnoses structural misalignment using a visual attention entropy probe to localize risky mediators unsupervisedly. We then execute a targeted causal intervention via numerical logit saturation to physically sever the shortcut path. Finally, a conflict-gated cooperative decoding strategy reconciles interventional faithfulness with observational fluency. Extensive experiments demonstrate that Fox achieves SOTA performance, outperforming SID by 29.1% while preserving linguistic richness. Code is available at this https URL.

Comments:	29 pages, 25 figures. Accepted by ICML 2026
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2606.27596 [cs.CV]
	(or arXiv:2606.27596v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2606.27596 arXiv-issued DOI via DataCite

Submission history

From: Liu Yu [view email]
[v1] Thu, 25 Jun 2026 22:55:46 UTC (6,070 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

3w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup