Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models
Quick Take
Modern vision-language models (VLMs) show high prediction stability but often yield incorrect answers, challenging the assumption that consistent outputs indicate geometric understanding. The newly introduced ViewDiag protocol evaluates VLMs across multiple metrics, revealing a disconnect between predictions and viewpoint-specific evidence, suggesting reliance on prior-driven collapse rather than evidence-sensitive reasoning.
Key Points
- Leading VLMs produce consistent yet incorrect answers on metric distance queries.
- ViewDiag evaluates models on metric accuracy, distributional concentration, and latent feature probing.
- High prediction stability is observed alongside significant errors across diverse models.
- Results indicate that cross-view consistency does not equate to geometric understanding.
- ViewDiag serves as a benchmark for assessing spatial VLMs beyond mere accuracy.
Article Content
From source RSS / original summaryarXiv:2606. 02742v1 Announce Type: new Abstract: Spatial reasoning is fundamental to robotics, autonomy, and embodied AI, yet modern vision-language models (VLMs) remain unreliable on metric distance queries. A common assumption is that consistent predictions across viewpoints reflect geometric grounding.
We test this assumption and find the opposite: leading VLMs often produce view-invariant and consistent answers even when those answers are incorrect, indicating weak coupling between predictions and viewpoint-specific visual evidence. We introduce \textbf{ViewDiag}, a controlled multi-view evaluation protocol built from Hypersim, ScanNet, and KITTI360, comprising 176 object-pair tracks across 80 scenes with 2--10 views per track.
The protocol evaluates models along three axes: metric accuracy, distributional concentration, and a latent feature probe for internal collapse that distinguishes decision collapse from representation collapse. Across diverse models, we observe a consistent pattern of high prediction stability paired with substantial error, clustering in a regime characterized by strong consistency but low accuracy. \noindent These results challenge the common use of cross-view consistency as a proxy for geometric understanding.
Instead, we show that stable predictions may reflect prior-driven collapse rather than evidence-sensitive reasoning. ViewDiag provides a controlled benchmark and diagnostic framework for evaluating spatial VLMs beyond accuracy alone. The code and data can be found \href{https://github. com/SDivakarBhat/Consistent_Yet_Wrong. git}{here}
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records
Plan2Map introduces a 208-case benchmark for reconstructing geospatial boundaries from UK planning documents. The GeoPlanAgent system achieves a mean IoU of 0.736, significantly outperforming baseline models, highlighting the challenges in localization and map registration.