AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

arXiv cs.CV·Yaoting Wang, Ziyi Zhang, Wenming Tu, Shaoxuan Xu, Wenjie Du, Cheng Liang, Weijun Wang, Yuanchao Li, Guangyao Li, Hao Fei, Yuanchun Li, Henghui Ding, Yunxin Liu

6d ago

·~1 min·6/9/2026·en·0

Quick Answer

AVI-Bench introduces a comprehensive benchmark for evaluating Omni-Multimodal Large Language Models (Omni-MLLMs) across perception, understanding, and reasoning stages.

Quick Take

AVI-Bench introduces a comprehensive benchmark for evaluating Omni-Multimodal Large Language Models (Omni-MLLMs) across perception, understanding, and reasoning stages. It highlights significant limitations in current models and proposes an extension, AVI-Bench-PriSe, to test generalization using low-semantic stimuli. This framework aims to enhance the robustness and generalizability of audio-visual intelligence.

Key Points

AVI-Bench evaluates Omni-MLLMs through cross-modal tasks requiring joint audio-visual interpretation.
The benchmark identifies substantial limitations in existing Omni-MLLMs' audio-visual intelligence.
AVI-Bench-PriSe tests models' robustness using unfamiliar, low-semantic stimuli.
A four-level AVI taxonomy is proposed based on experimental findings.
The project aims to guide the development of more robust audio-visual intelligence.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 07643v1 Announce Type: new Abstract: Recent advances in Omni-Multimodal Large Language Models (Omni-MLLMs) have enabled strong integration of vision, audio, and language. However, their audio-visual intelligence (AVI) remains insufficiently evaluated due to the lack of systematic and comprehensive benchmarks.

We introduce AVI-Bench, a cognitively inspired benchmark that evaluates Omni-MLLMs across three stages, perception, understanding, and reasoning, through cross-modal tasks requiring joint audio-visual interpretation. This design enables fine-grained diagnosis of model capabilities and failure modes.

To further assess robustness beyond familiar domains, we propose AVI-Bench-PriSe, an extension that probes models' primitive audio-visual sensation using unfamiliar, low-semantic stimuli, testing generalization beyond common training distributions. Extensive experiments on both open-source and closed-source models reveal substantial limitations in current Omni-MLLMs. Based on these findings, we present a four-level AVI taxonomy.

Overall, AVI-Bench provides a principled evaluation framework to guide the development of more robust and generalizable AVI. Project website: https://fudancvl. github. io/AVI-Bench/

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup