HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding

arXiv cs.CV·Mengqi Shi, Haopeng Zhang

17h ago

·~2 min·5/20/2026·en·0

Quick Take

HAVEN introduces a comprehensive benchmark for unified video understanding with hierarchical multimodal alignment.

Key Points

Addresses gaps in existing multimodal evaluation benchmarks.
Offers a unified dataset architecture for video and text.
Provides a rigorous testbed for future multimodal research.

📖 Reader Mode

~2 min read

[Submitted on 19 May 2026]

View PDF HTML (experimental)

Abstract:While Multimodal Large Language Models (MLLMs) exhibit strong performance on standard video tasks, their ability to faithfully summarize and reason over complex narratives remains poorly evaluated. Existing summarization benchmarks fragment supervision across isolated granularities, such as keyframes, key shots, or disjointed text summaries, failing to capture the inherently hierarchical structure of cross-modal alignment. To address this critical gap, we introduce HAVEN, a hierarchically aligned multimodal benchmark for unified video understanding. HAVEN pioneers a fully granular (frame, shot, and video levels) and fully multimodal (video and text) dataset architecture, complete with explicit, continuous alignment between modalities. Built upon this unified annotation paradigm, we propose a comprehensive evaluation suite spanning summarization, temporal reasoning, multimodal grounding, and saliency ranking. Extensive benchmarking of state-of-the-art MLLMs exposes a persistent gap between surface-level textual fluency and grounded multimodal understanding. Ultimately, HAVEN advances the evaluation of multimodal systems beyond traditional QA formats, offering a rigorous, standardized testbed to drive future research in interpretable, hierarchical video understanding. We publicly release the dataset, benchmark suite, and evaluation protocols.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2605.19223 [cs.CV]
	(or arXiv:2605.19223v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.19223 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Haopeng Zhang [view email]
[v1] Tue, 19 May 2026 00:48:14 UTC (9,750 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CV

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Related in this space

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets