Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

arXiv cs.CV·Svetlana Orlova, Niccol\`o Cavagnero, Gijs Dubbelman

17h ago

·~2 min·5/20/2026·en·1

Quick Take

This study explores using frozen image models for efficient video pre-training focused on temporal reasoning.

Key Points

Video models typically require extensive data and compute resources.
Frozen image models can serve as spatial encoders.
Initial results show strong temporal performance with reduced video data.

📖 Reader Mode

~2 min read

[Submitted on 18 May 2026]

View PDF HTML (experimental)

Abstract:Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model. Code: this https URL .

Comments:	Accepted to CVPR 2026 Workshops CV4Smalls
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2605.19137 [cs.CV]
	(or arXiv:2605.19137v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.19137 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Svetlana Orlova [view email]
[v1] Mon, 18 May 2026 21:35:09 UTC (100 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CV

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Related in this space

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets