OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation
Quick Take
OmniMem introduces a scalable memory retrieval framework for autoregressive video generation, enhancing long-video performance by 52.3% in Dynamic Degree while maintaining memory efficiency. This method addresses local bias and memory access issues, allowing for improved retrieval of historical details.
Key Points
- OmniMem enables sparse KV retrieval for long video generation.
- Adaptive Window Exclusion enhances long-range history access.
- Query-Shared KV Selection minimizes cross-query diversity.
- Per-Head Scattered KV Access allows tailored retrieval patterns.
- Experiments show strong consistency against established baselines.
Article Content
From source RSS / original summaryarXiv:2605. 30519v1 Announce Type: new Abstract: Autoregressive (AR) video generation extends videos by producing latent chunks sequentially, but scaling to long videos requires repeated access to a growing historical KV cache. Existing methods reduce this cost by truncating the KV cache or compressing it into implicit memory, but both lose explicit access to query-relevant historical details.
We propose OmniMem, an explicit full-range memory retrieval framework that performs sparse KV retrieval over the historical cache. To make this practical for chunk-based AR video generation, OmniMem addresses two issues: (i) local bias in sparse KV selection and (ii) Union Explosion in memory access. Adaptive Window Exclusion removes local-window blocks from the selection candidates when sufficient long-range history is available, preserving the sparse budget for informative long-range retrieval.
Query-Shared KV Selection reduces cross-query diversity, while Per-Head Scattered KV Access avoids expanding head-specific selections into a large selected KV buffer. This allows each attention head to retrieve non-contiguous KV blocks according to its own selection pattern. Experiments on long-video generation show that OmniMem improves Dynamic Degree by 52. 3% and preserves strong consistency over strong baselines, while maintaining comparable memory usage.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, enabling efficient fine-tuning with only 0.11% parameter updates. It significantly enhances performance in few-shot learning and domain shifts across 15 biomedical imaging datasets, demonstrating robustness for clinical applications.