Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs
Quick Take
Fre-Res optimizes video token compression by balancing spatial fidelity and temporal coverage.
Key Points
- Introduces budget-adaptive dual-track video-token compression.
- Uses 1D-DCT for efficient temporal residual representation.
- Achieves high accuracy with reduced visual-token length.
📖 Reader Mode
~2 min readAuthors:Yigui Feng (1), Qinglin Wang (1), Yang Liu (2), Jie Liu (1) ((1) The College of Computer Science, National University of Defense Technology, Changsha, Hunan, China, (2) The Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou, Guangdong, China)
Abstract:Video MLLMs face a persistent tension between spatial fidelity and temporal coverage: preserving fine-grained visual details requires many spatial tokens, while capturing short-lived events requires dense temporal sampling. We propose \textbf{Fre-Res}, a budget-adaptive dual-track video-token compression framework that separates these two forms of evidence. Fre-Res preserves sparse high-fidelity spatial anchors and represents dense temporal evolution through compact residual-frequency tokens. Specifically, it applies temporal 1D-DCT to inter-frame residual trajectories in vision-latent space, where we observe strong low-frequency concentration. To align frequency-domain dynamics with native visual embeddings, Fre-Res introduces a Spatial-Guided Absorber that injects temporal residual information into spatially corresponding anchor tokens. Across fine-grained short-video and long-video reasoning benchmarks, Fre-Res achieves a favorable accuracy--efficiency trade-off, matching or approaching full-token performance while substantially reducing visual-token length. Extensive ablations further show that temporal-frequency residuals preserve causal transition cues, while spatial anchors remain essential for fine-grained object and layout reasoning.
| Comments: | 24 pages, 5 figures |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) |
| ACM classes: | I.2.10 |
| Cite as: | arXiv:2605.16366 [cs.CV] |
| (or arXiv:2605.16366v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16366 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Yigui Feng [view email]
[v1]
Sun, 10 May 2026 03:06:11 UTC (2,387 KB)
— Originally published at arxiv.org
More from arXiv cs.CV
See more →GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
GeoSym127K introduces a scalable neuro-symbolic framework for enhanced geometric reasoning in multimodal models.