Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

arXiv cs.CV·Yigui Feng (The College of Computer Science, National University of Defense Technology, Changsha, Hunan, China), Qinglin Wang (The College of Computer Science, National University of Defense Technology, Changsha, Hunan, China), Yang Liu (The Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou, Guangdong, China), Jie Liu (The College of Computer Science, National University of Defense Technology, Changsha, Hunan, China)

1d ago

·~2 min·5/19/2026·en·2

Quick Take

Fre-Res optimizes video token compression by balancing spatial fidelity and temporal coverage.

Key Points

Introduces budget-adaptive dual-track video-token compression.
Uses 1D-DCT for efficient temporal residual representation.
Achieves high accuracy with reduced visual-token length.

📖 Reader Mode

~2 min read

[Submitted on 10 May 2026]

Authors:Yigui Feng (1), Qinglin Wang (1), Yang Liu (2), Jie Liu (1) ((1) The College of Computer Science, National University of Defense Technology, Changsha, Hunan, China, (2) The Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou, Guangdong, China)

View PDF HTML (experimental)

Abstract:Video MLLMs face a persistent tension between spatial fidelity and temporal coverage: preserving fine-grained visual details requires many spatial tokens, while capturing short-lived events requires dense temporal sampling. We propose \textbf{Fre-Res}, a budget-adaptive dual-track video-token compression framework that separates these two forms of evidence. Fre-Res preserves sparse high-fidelity spatial anchors and represents dense temporal evolution through compact residual-frequency tokens. Specifically, it applies temporal 1D-DCT to inter-frame residual trajectories in vision-latent space, where we observe strong low-frequency concentration. To align frequency-domain dynamics with native visual embeddings, Fre-Res introduces a Spatial-Guided Absorber that injects temporal residual information into spatially corresponding anchor tokens. Across fine-grained short-video and long-video reasoning benchmarks, Fre-Res achieves a favorable accuracy--efficiency trade-off, matching or approaching full-token performance while substantially reducing visual-token length. Extensive ablations further show that temporal-frequency residuals preserve causal transition cues, while spatial anchors remain essential for fine-grained object and layout reasoning.

Comments:	24 pages, 5 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
ACM classes:	I.2.10
Cite as:	arXiv:2605.16366 [cs.CV]
	(or arXiv:2605.16366v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.16366 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yigui Feng [view email]
[v1] Sun, 10 May 2026 03:06:11 UTC (2,387 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CV

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Related in this space

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets