Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

arXiv cs.CV·Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, Antonio Torralba

5/28/2026

·~1 min·5/28/2026·en·3

Quick Answer

Tensor Memory introduces a fixed-size recurrent 3D memory tensor to Transformers, enhancing long-horizon video understanding by decoupling state capacity from input length.

Quick Take

Tensor Memory introduces a fixed-size recurrent 3D memory tensor to Transformers, enhancing long-horizon video understanding by decoupling state capacity from input length. This lightweight module integrates seamlessly into existing architectures, improving performance on standard benchmarks without altering the overall structure.

Key Points

Tensor Memory uses a differentiable soft write for voxel grid updates.
Memory tensor size remains constant, improving efficiency for long sequences.
Integrates with standard Transformer training pipelines easily.
Evaluated on language, image, and video benchmarks.
Designed to enhance occlusion-sensitive reasoning capabilities.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2605. 27686v1 Announce Type: new Abstract: Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial state, making long-horizon video understanding and occlusion-sensitive reasoning difficult.

We propose Tensor Memory, a lightweight module that augments Transformer blocks with a fixed-size recurrent 3D memory tensor: tokens write into a voxel grid via a differentiable soft write that deposits content as a Gaussian-weighted volume around a predicted continuous 3D location, the memory is updated with an efficient local interaction operator and gated recurrent dynamics, and tokens read back context via continuous sampling with gated residual fusion.

Because the memory tensor has a constant size, Tensor Memory decouples state capacity from input length while preserving a spatial inductive bias. We evaluate the module on standard language, image, and video benchmarks and on a controlled toy diagnostic suite designed to isolate when persistent state is beneficial; it integrates with standard Transformer training pipelines and can be attached to or removed from existing blocks without other architectural changes.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Aavash Chhetri, Bibek Niroula, Eduard Vazquez, Yash Raj Shrestha, Prashnna Gyawali, Loris Bazzani, Binod Bhattarai

5d ago

FeaturedOriginal

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

AI Summary

ProMoE-FL introduces a Prototype-conditioned Mixture-of-Experts framework for multimodal federated learning, effectively addressing missing modalities. It outperforms existing methods on four chest X-ray datasets, demonstrating superior feature synthesis capabilities in both homogeneous and heterogeneous settings.

#LLM #AI Coding #AI Startup #Enterprise AI