UltraVR: A Diagnostic Ultra-Resolution Image-VQA Benchmark for Evidence-Grounded Reasoning
Quick Answer
UltraVR introduces a benchmark for evaluating vision-language models (VLMs) on ultra-resolution images, revealing significant shortcomings in evidence-grounded reasoning.
Quick Take
UltraVR introduces a benchmark for evaluating vision-language models (VLMs) on ultra-resolution images, revealing significant shortcomings in evidence-grounded reasoning. Current models struggle with tasks like fine-grained object grounding and spatial comparisons, indicating a need for improved visual evidence integration. This benchmark allows for detailed diagnostics of model failures, particularly in evidence grounding and local perception.
Key Points
- UltraVR benchmarks VLMs across four scenarios: CCTV, remote sensing, pathology, and anomaly detection.
- Structured annotations in UltraVR enable detailed process-level diagnostics of reasoning failures.
- Current VLMs show unreliable performance on ultra-resolution reasoning tasks.
- Errors are primarily found in evidence grounding and local perception stages.
- Downstream inference often improves when intermediate visual facts are provided.
Article Content
From source RSS / original summaryarXiv:2606. 05576v1 Announce Type: new Abstract: Vision-language models (VLMs) excel on visual question answering and multimodal reasoning benchmarks. Yet their capability on ultra-resolution images - where critical evidence is tiny, subtle, spatially distant, or distributed - remains unclear. Existing evaluations largely report final-answer accuracy, offering limited insight into whether models acquire and integrate the necessary visual evidence.
We introduce UltraVR, a diagnostic benchmark for evidence-grounded visual reasoning over ultra-resolution images. UltraVR spans four high-value scenarios: CCTV surveillance, remote sensing (RS), whole-slide image (WSI) pathology, and industrial anomaly detection (AD). These domains pose complementary challenges: fine-grained object grounding in crowded CCTV scenes, long-range spatial comparison in RS, multi-scale evidence navigation in WSI, and subtle irregularity detection in repetitive industrial layouts.
Beyond standard QA triples, each instance includes a structured ground-truth chain of thought with step-level questions, intermediate answers, and reasoning labels. These labels decompose reasoning into evidence grounding, local perception, quantification, evidence integration, and decision inference, enabling process-level diagnosis over black-box scoring. Using UltraVR, we evaluate frontier VLMs and show that current models remain far from reliable on ultra-resolution reasoning.
Importantly, the structured annotations allow us to localize failures across the visual-to-decision pipeline: errors concentrate in evidence grounding and local perception, while downstream inference often recovers when intermediate visual facts are supplied. These findings demonstrate UltraVR as a diagnostic testbed for measuring not only whether VLMs answer correctly, but where their ultra-resolution reasoning process breaks.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.