From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
Quick Answer
This study investigates the information flow in Audio-Visual Large Language Models (AVLLMs) like Qwen2.5-Omni and Video-SALMONN2 Plus, revealing that audio-visual signals are integrated through sequential and parallel pathways.
Quick Take
This study investigates the information flow in Audio-Visual Large Language Models (AVLLMs) like Qwen2.5-Omni and Video-SALMONN2 Plus, revealing that audio-visual signals are integrated through sequential and parallel pathways. The findings suggest that discarding certain token types post-integration can enhance model efficiency without compromising predictions, paving the way for advancements in applications.
Key Points
- AVLLMs utilize sequential pathways for audio-visual video integration, similar to VLMs.
- In interleaved audio-visual settings, information routing shifts to parallel streams.
- Discarding certain token types post-integration shows minimal impact on predictions.
- Findings are consistent across models like Qwen2.5-Omni and Video-SALMONN2 Plus.
- Study lays groundwork for future interpretability and efficiency in MLLMs.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10147v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood.
In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items.
We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams.
Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2. 5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge.
Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.