From Hallucination to Grounding: Diagnosing Visual Spatial Intelligence via CRISP
Quick Answer
CRISP introduces a novel evaluation paradigm for visual spatial intelligence, revealing a disconnect between perception and reasoning in proprietary and open-source models.
Quick Take
CRISP introduces a novel evaluation paradigm for visual spatial intelligence, revealing a disconnect between perception and reasoning in proprietary and open-source models. While proprietary models show strong latent reasoning, they struggle with metric estimation, whereas open-source models lack multi-hop reasoning capabilities. This framework shifts focus from simple guessing to genuine perception and reasoning.
Key Points
- CRISP utilizes metric 3D Scene Graphs for detailed evaluation of visual spatial intelligence.
- Proprietary models excel in latent reasoning but fail in accurate metric estimation.
- Open-source models are limited by their inability to perform multi-hop compositional reasoning.
- The framework emphasizes genuine perception and reasoning over mere language prior guessing.
- Code and dataset for CRISP are available on GitHub.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Current VLM evaluations often conflate language priors with genuine spatial reasoning. To address this, we introduce CRISP, a novel structural-diagnostic evaluation paradigm that assesses visual spatial intelligence through consistency, the alignment between implicit perception and explicit reasoning. Unlike traditional black-box QA, CRISP utilizes metric 3D Scene Graphs and an oracle intervention protocol to decouple latent reasoning capabilities from perceptual bottlenecks. This granular diagnosis uncovers a systematic perception-reasoning disconnect. Crucially, we reveal that while proprietary models possess robust latent reasoning engines, they suffer from inaccurate metric estimation and a critical failure to leverage their implicit structural representations. Conversely, open-source models remain fundamentally bottlenecked by their lack of multi-hop compositional reasoning. By shifting the focus from merely ``guessing correctly'' via language priors to genuinely ``perceiving, verifying, and reasoning,'' CRISP offers a rigorous roadmap for multimodal alignment beyond end-to-end post-training. The code and dataset are available at this https URL.
| Comments: | Accepted to ECCV 2026 |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2606.26535 [cs.CV] |
| (or arXiv:2606.26535v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2606.26535 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Zhixing Li [view email]
[v1]
Thu, 25 Jun 2026 02:18:38 UTC (7,867 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.