Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos

arXiv cs.CV·Bonan Ding, Umair Nawaz, Ufaq Khan, Abdelrahman M. Shaker, Muhammad Haris Khan, Jiale Cao, Jin Xie, Fahad Shahbaz Khan

2h ago

·~2 min·5/27/2026·en·0

Quick Take

UniMVU enhances multimodal video understanding through instruction-aware gating, improving performance across diverse modalities.

Key Points

Dynamic gating reduces modality interference in video processing.
Achieves up to 13.5 gains in CIDEr metric across benchmarks.
Combines cross-modal attention with adaptive gating mechanisms.

Article Content

From source RSS / original summary

arXiv:2605. 26232v1 Announce Type: new Abstract: Pre-trained video large language models excel at visual reasoning. However, they struggle when videos arrive with auxiliary streams, such as audio, depth map, or dense temporal evidence. In such a scenario, uniform fusion induces modality interference, allowing irrelevant channels to distract the model.

To address this issue, we present a unified multimodal video understanding framework, named UniMVU, that performs instruction-aware fusion across video, audio, depth map, or any other modality inputs via two levels of dynamic gating: inner-modality gates emphasize salient regions within each modality, whereas modality-level gates re-weight whole streams; both are conditioned on the text instruction to adaptively balance modality importance.

Our UniMVU combines cross-modal self-attention with instruction-driven inner-modality gating module and a modality-level gating module with control token; for time-aligned streams we further adopt a fast-to-slow fusion scheme that reduces redundancy. Across six benchmarks (AVQA, AVSD, Music-AVQA, ScanQA, SQA3D and MVBench), our UniMVU achieves consistent gains over static-fusion baselines achieving gains as high as 13. 5 in terms of CIDEr metric.

Further, our analysis shows that the gating mechanism aligns with the human-interpretable modality relevance, and ablations show the contributions of inner-modality and modality-level gating. Our UniMVU provides a simple, unified recipe for instruction-aware multimodal video understanding that scales to diverse modalities without hand-crafted fusion rules.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CV

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

Deep Learning-Based Automated Quantification of TIMI Myocardial Perfusion Frame Count (DL-TMPFC) from Coronary Angiography: A Novel Framework for Rapid Assessment of Microvascular Dysfunction

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning