OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs
Quick Answer
OmniMem introduces a memory-efficient streaming framework for audio-visual LLMs, enhancing long-video inference by 2-4% accuracy over existing methods.
Quick Take
OmniMem introduces a memory-efficient streaming framework for audio-visual LLMs, enhancing long-video inference by 2-4% accuracy over existing methods. It employs a modality-aware memory allocation strategy and budget-aware fine-tuning, achieving improved performance on benchmarks like VideoMME Long and LVBench. This innovation addresses token imbalance and preserves informative KV states, benefiting models like video-SALMONN 2+ and Qwen-2.5-Omni.
Key Points
- OmniMem improves long-video inference accuracy by 2-4% over strong training-free baselines.
- Introduces modality-aware memory allocation to manage visual and audio contexts separately.
- Employs perturbation-aware memory selection to preserve informative KV states.
- Budget-aware fine-tuning consolidates useful information into retained memory.
- Demonstrated effectiveness on benchmarks like VideoMME Long, LVBench, and LVOmniBench.
Article Content
From source RSS / original summaryarXiv:2606. 07577v1 Announce Type: new Abstract: Audio-visual large language models (LLMs) hold strong promise for long-form video understanding, yet their long-video inference is fundamentally limited by the linear growth of video tokens and key-value (KV) caches. We present OmniMem, a memory-efficient streaming framework designed specifically for audio-visual LLMs.
Unlike existing compression methods that treat all tokens uniformly, OmniMem introduces a modality-aware memory allocation strategy that separately manages visual and audio contexts, addressing the severe token imbalance between the two modalities. OmniMem further preserves informative and non-redundant KV states through perturbation-aware memory selection, enabling compact memory without sacrificing long-range understanding.
To strengthen compression under realistic deployment constraints, we also explore budget-aware fine-tuning, which encourages the model to consolidate useful information into retained memory. Experiments on VideoMME Long, LVBench, and LVOmniBench with video-SALMONN 2+ and Qwen-2. 5-Omni show that OmniMem consistently improves over strong training-free compression baselines by 2-4% absolute accuracy under the same memory budgets, with an additional 1-2% gain after fine-tuning.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
This paper addresses the sim-to-real gap for foundation model agents by framing it within a Markov Decision Process (MDP) structure. It advocates for established solutions like domain randomization to enhance agent robustness, aiming to create standardized benchmarks for reliable real-world applications.