From Hallucination to Grounding: Diagnosing Visual Spatial Intelligence via CRISP

arXiv cs.CV·Zhixing Li, Yinan Yu

5d ago

·~2 min·6/26/2026·en·0

Quick Answer

CRISP introduces a novel evaluation paradigm for visual spatial intelligence, revealing a disconnect between perception and reasoning in proprietary and open-source models.

Quick Take

CRISP introduces a novel evaluation paradigm for visual spatial intelligence, revealing a disconnect between perception and reasoning in proprietary and open-source models. While proprietary models show strong latent reasoning, they struggle with metric estimation, whereas open-source models lack multi-hop reasoning capabilities. This framework shifts focus from simple guessing to genuine perception and reasoning.

Key Points

CRISP utilizes metric 3D Scene Graphs for detailed evaluation of visual spatial intelligence.
Proprietary models excel in latent reasoning but fail in accurate metric estimation.
Open-source models are limited by their inability to perform multi-hop compositional reasoning.
The framework emphasizes genuine perception and reasoning over mere language prior guessing.
Code and dataset for CRISP are available on GitHub.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 25 Jun 2026]

View PDF HTML (experimental)

Abstract:Current VLM evaluations often conflate language priors with genuine spatial reasoning. To address this, we introduce CRISP, a novel structural-diagnostic evaluation paradigm that assesses visual spatial intelligence through consistency, the alignment between implicit perception and explicit reasoning. Unlike traditional black-box QA, CRISP utilizes metric 3D Scene Graphs and an oracle intervention protocol to decouple latent reasoning capabilities from perceptual bottlenecks. This granular diagnosis uncovers a systematic perception-reasoning disconnect. Crucially, we reveal that while proprietary models possess robust latent reasoning engines, they suffer from inaccurate metric estimation and a critical failure to leverage their implicit structural representations. Conversely, open-source models remain fundamentally bottlenecked by their lack of multi-hop compositional reasoning. By shifting the focus from merely ``guessing correctly'' via language priors to genuinely ``perceiving, verifying, and reasoning,'' CRISP offers a rigorous roadmap for multimodal alignment beyond end-to-end post-training. The code and dataset are available at this https URL.

Comments:	Accepted to ECCV 2026
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2606.26535 [cs.CV]
	(or arXiv:2606.26535v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2606.26535 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Zhixing Li [view email]
[v1] Thu, 25 Jun 2026 02:18:38 UTC (7,867 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

3w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup