Dismantling Pathological Shortcuts: A Causal Framework for Faithful LVLM Decoding
Quick Answer
The paper introduces Fox, a novel inference-time framework that addresses hallucination in Large Vision-Language Models (LVLMs) by diagnosing structural misalignment and severing risky shortcuts.
Quick Take
The paper introduces Fox, a novel inference-time framework that addresses hallucination in Large (LVLMs) by diagnosing structural misalignment and severing risky shortcuts. Fox outperforms the previous state-of-the-art method, SID, by 29.1% while maintaining linguistic richness, showcasing its effectiveness in enhancing model reliability.
Key Points
- Fox identifies risky mediators using a visual attention entropy probe.
- The framework executes causal interventions via numerical logit saturation.
- Fox achieves state-of-the-art performance, surpassing SID by 29.1%.
- The approach maintains linguistic richness while enhancing model faithfulness.
- Code for Fox is publicly available for further research.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Large Vision-Language Models (LVLMs) exhibit sophisticated reasoning but remain susceptible to object hallucination. Deviating from the prevailing attention intensity assumption, we reveal a deeper dynamic structural misalignment: hallucination is triggered at decision-critical steps where specific attention heads, acting as risky mediators, decouple from visual evidence to lock onto language priors. This establishes a pathological shortcut that bypasses visual grounding. To dismantle this, we propose Fox (Faithfulness and Observational-flow via eXpression-rectification), a training-free inference-time framework. Fox diagnoses structural misalignment using a visual attention entropy probe to localize risky mediators unsupervisedly. We then execute a targeted causal intervention via numerical logit saturation to physically sever the shortcut path. Finally, a conflict-gated cooperative decoding strategy reconciles interventional faithfulness with observational fluency. Extensive experiments demonstrate that Fox achieves SOTA performance, outperforming SID by 29.1% while preserving linguistic richness. Code is available at this https URL.
| Comments: | 29 pages, 25 figures. Accepted by ICML 2026 |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2606.27596 [cs.CV] |
| (or arXiv:2606.27596v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2606.27596 arXiv-issued DOI via DataCite |
Submission history
From: Liu Yu [view email]
[v1]
Thu, 25 Jun 2026 22:55:46 UTC (6,070 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.