Vision-driven Preference Synthesis for Mitigating Hallucinations in VLMs
Quick Answer
The ViPSy framework enhances Vision-Language Models (VLMs) by constructing preference pairs that are both policy-aligned and visually grounded, reducing hallucination rates by 35.7% on AMBER and 24.5% on Object HalBench.
Quick Take
The ViPSy framework enhances (VLMs) by constructing preference pairs that are both policy-aligned and visually grounded, reducing hallucination rates by 35.7% on AMBER and 24.5% on Object HalBench. This approach improves visual grounding benchmarks and semantic segmentation, showcasing its effectiveness in mitigating hallucinations.
Key Points
- ViPSy constructs preference data using visual cues from semantically aligned image variants.
- The framework reduces hallucination rates significantly compared to previous state-of-the-art methods.
- Performance improvements noted in MMStar, MMVP, and CV-Bench benchmarks.
- ViPSy maintains policy response distribution while leveraging visual information effectively.
- Code for ViPSy is publicly available for further research and development.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Vision-Language Models (VLMs) have shown strong performance in visual understanding, yet they still suffer from hallucinations, generating content that is not grounded in the image. Preference alignment is a promising approach to improve visual faithfulness, but its success depends heavily on how preference pairs are constructed. Existing methods exhibit two key limitations; (a) intervention-based methods often introduce significant deviation from the policy distribution, and (b) sampling-based methods often underuse visual information during the construction. In this paper, we propose ViPSy (Vision-driven Preference Synthesis), a framework for constructing preference data that are both policy-aligned and visually grounded. Our framework consists of two stages; in the first stage, ViPSy derives a visual cue from recurring object-level content across semantically aligned image variants, so preference construction can rely on visual information rather than language priors. In the second stage, ViPSy conditions the policy's own rollouts on this cue, allowing candidates to be guided by visually grounded content while staying close to the policy's response distribution. The resulting candidates remain close to the policy's response distribution while better leveraging visual information from the image. Experiments show that the resulting VLM, preference-aligned with ViPSy-constructed preference pairs, achieves a new state-of-the-art in hallucination mitigation. Compared with the previous state-of-the-art method, it reduces hallucination rates on AMBER and Object HalBench by 35.7% and 24.5%, respectively. The resulting model further improves on general visual grounding benchmarks, e.g., MMStar, MMVP, and CV-Bench, while also yielding gains in semantic segmentation and ImageNet linear probing, underscoring the effectiveness of our framework in enhancing the model's visual capabilities.
| Comments: | 29 pages; Code is available at this https URL |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) |
| Cite as: | arXiv:2606.28401 [cs.CV] |
| (or arXiv:2606.28401v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2606.28401 arXiv-issued DOI via DataCite |
Submission history
From: Yunhun Nam [view email]
[v1]
Wed, 24 Jun 2026 11:06:22 UTC (7,771 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.