Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers
Quick Take
Tensor Memory introduces a fixed-size recurrent 3D memory tensor to Transformers, enhancing long-horizon video understanding by decoupling state capacity from input length. This lightweight module integrates seamlessly with existing Transformer architectures and shows improved performance on standard benchmarks.
Key Points
- Tensor Memory uses a differentiable soft write to deposit content in a voxel grid.
- The memory tensor maintains a constant size, improving efficiency for long sequences.
- Integrates with standard Transformer training pipelines without architectural changes.
- Evaluated on language, image, and video benchmarks, showing significant benefits.
- Designed to enhance occlusion-sensitive reasoning in video understanding.
Article Content
From source RSS / original summaryarXiv:2605. 27686v1 Announce Type: new Abstract: Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial state, making long-horizon video understanding and occlusion-sensitive reasoning difficult.
We propose Tensor Memory, a lightweight module that augments Transformer blocks with a fixed-size recurrent 3D memory tensor: tokens write into a voxel grid via a differentiable soft write that deposits content as a Gaussian-weighted volume around a predicted continuous 3D location, the memory is updated with an efficient local interaction operator and gated recurrent dynamics, and tokens read back context via continuous sampling with gated residual fusion.
Because the memory tensor has a constant size, Tensor Memory decouples state capacity from input length while preserving a spatial inductive bias. We evaluate the module on standard language, image, and video benchmarks and on a controlled toy diagnostic suite designed to isolate when persistent state is beneficial; it integrates with standard Transformer training pipelines and can be attached to or removed from existing blocks without other architectural changes.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.