UltraVR: A Diagnostic Ultra-Resolution Image-VQA Benchmark for Evidence-Grounded Reasoning

arXiv cs.CV·Gexin Huang, Yanting Yang, Myeongkyun Kang, Beidi Zhao, Jun Zhou, Chen Zhou, Gang Wang, Zu-hua Gao, Xiaoxiao Li

6/5/2026

·~2 min·6/5/2026·en·1

Quick Answer

UltraVR introduces a benchmark for evaluating vision-language models (VLMs) on ultra-resolution images, revealing significant shortcomings in evidence-grounded reasoning.

Quick Take

Current models struggle with tasks like fine-grained object grounding and spatial comparisons, indicating a need for improved visual evidence integration. This benchmark allows for detailed diagnostics of model failures, particularly in evidence grounding and local perception.

Key Points

UltraVR benchmarks across four scenarios: CCTV, remote sensing, pathology, and anomaly detection.
Structured annotations in UltraVR enable detailed process-level diagnostics of reasoning failures.
Current VLMs show unreliable performance on ultra-resolution reasoning tasks.
Errors are primarily found in evidence grounding and local perception stages.
Downstream inference often improves when intermediate visual facts are provided.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

(VLMs) excel on visual question answering and multimodal reasoning benchmarks. Yet their capability on ultra-resolution images - where critical evidence is tiny, subtle, spatially distant, or distributed - remains unclear. Existing evaluations largely report final-answer accuracy, offering limited insight into whether models acquire and integrate the necessary visual evidence. We introduce UltraVR, a diagnostic benchmark for evidence-grounded visual reasoning over ultra-re

Read the full article on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Aavash Chhetri, Bibek Niroula, Eduard Vazquez, Yash Raj Shrestha, Prashnna Gyawali, Loris Bazzani, Binod Bhattarai

1w ago

FeaturedOriginal

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

AI Summary

ProMoE-FL introduces a Prototype-conditioned Mixture-of-Experts framework for multimodal federated learning, effectively addressing missing modalities. It outperforms existing methods on four chest X-ray datasets, demonstrating superior feature synthesis capabilities in both homogeneous and heterogeneous settings.

#LLM #AI Coding #AI Startup #Enterprise AI

UltraVR: A Diagnostic Ultra-Resolution Image-VQA Benchmark for Evidence-Grounded Reasoning

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CV

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

-Guided ANN Index Optimization for Human-Object Interaction Retrieval

A Synthetic 3D Gear Dataset for Manufacturing Quality Inspection (MFGNet-Gear)

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CV

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

A Synthetic 3D Gear Dataset for Manufacturing Quality Inspection (MFGNet-Gear)

-Guided ANN Index Optimization for Human-Object Interaction Retrieval