AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs
Quick Answer
AVI-Bench introduces a comprehensive benchmark for evaluating Omni-Multimodal Large Language Models (Omni-MLLMs) across perception, understanding, and reasoning stages.
Quick Take
AVI-Bench introduces a comprehensive benchmark for evaluating Omni-Multimodal Large Language Models (Omni-MLLMs) across perception, understanding, and reasoning stages. It highlights significant limitations in current models and proposes an extension, AVI-Bench-PriSe, to test generalization using low-semantic stimuli. This framework aims to enhance the robustness and generalizability of audio-visual intelligence.
Key Points
- AVI-Bench evaluates Omni-MLLMs through cross-modal tasks requiring joint audio-visual interpretation.
- The benchmark identifies substantial limitations in existing Omni-MLLMs' audio-visual intelligence.
- AVI-Bench-PriSe tests models' robustness using unfamiliar, low-semantic stimuli.
- A four-level AVI taxonomy is proposed based on experimental findings.
- The project aims to guide the development of more robust audio-visual intelligence.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 07643v1 Announce Type: new Abstract: Recent advances in Omni-Multimodal Large Language Models (Omni-MLLMs) have enabled strong integration of vision, audio, and language. However, their audio-visual intelligence (AVI) remains insufficiently evaluated due to the lack of systematic and comprehensive benchmarks.
We introduce AVI-Bench, a cognitively inspired benchmark that evaluates Omni-MLLMs across three stages, perception, understanding, and reasoning, through cross-modal tasks requiring joint audio-visual interpretation. This design enables fine-grained diagnosis of model capabilities and failure modes.
To further assess robustness beyond familiar domains, we propose AVI-Bench-PriSe, an extension that probes models' primitive audio-visual sensation using unfamiliar, low-semantic stimuli, testing generalization beyond common training distributions. Extensive experiments on both open-source and closed-source models reveal substantial limitations in current Omni-MLLMs. Based on these findings, we present a four-level AVI taxonomy.
Overall, AVI-Bench provides a principled evaluation framework to guide the development of more robust and generalizable AVI. Project website: https://fudancvl. github. io/AVI-Bench/
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.