Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?
Quick Take
A study on vision-language models (VLMs) reveals they often overconfidently answer spatial questions despite occlusion and perspective ambiguity, achieving only 30% accuracy under occlusion and below 10% under perspective ambiguity. This highlights the need for models to recognize when to abstain from answering and seek additional views for reliable evidence.
Key Points
- Existing benchmarks assume VLMs can always answer spatial questions reliably.
- SpatialUncertain framework introduces occlusion and perspective ambiguity challenges.
- VLMs show 30% accuracy under occlusion and below 10% under perspective ambiguity.
- Models struggle to identify additional viewpoints that would clarify ambiguity.
- Findings suggest a shift in evaluation focus from answer correctness to model abstention.
Article Content
From source RSS / original summaryarXiv:2605. 30557v1 Announce Type: new Abstract: Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading.
Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed.
In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges.
We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity.
Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, enabling efficient fine-tuning with only 0.11% parameter updates. It significantly enhances performance in few-shot learning and domain shifts across 15 biomedical imaging datasets, demonstrating robustness for clinical applications.