Video2LoRA: Parametric Video Internalization for Vision-Language… | AI Deep Signal

Video2LoRA: Parametric Video Internalization for Vision-Language Models

arXiv cs.CV·Manan Suri, Sarvesh Baskar, Dinesh Manocha

6/4/2026

·~2 min·6/4/2026·en·3

Quick Answer

Video2LoRA introduces a novel method for parametric video internalization in vision-language models, enabling SmolVLM2 to answer queries with zero visual tokens.

Quick Take

It reduces visual-token load by up to 1,500x and query TTFT by 6-80x while maintaining performance across multiple benchmarks.

Key Points

Video2LoRA generates LoRA adapters directly from video in a single forward pass.
Achieves equivalent performance to direct video-in-context inference across five captioning benchmarks.
Reduces answer-time visual-token load by up to 1,500x and query TTFT by 6-80x.
Stable performance up to 1,024 frames and 1024px resolution.
Supports independent adapter generation for non-overlapping video segments.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

From the original publisher, up to about 700 characters

arXiv:2606. 04351v1 Announce Type: new Abstract: Processing video in is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Video2LoRA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass.

Unlike standard LoRA fine-tuning, which requires iterative gradient updates, Video2LoRA predicts these weights directly from the video. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Aavash Chhetri, Bibek Niroula, Eduard Vazquez, Yash Raj Shrestha, Prashnna Gyawali, Loris Bazzani, Binod Bhattarai

1w ago

FeaturedOriginal

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

AI Summary

ProMoE-FL introduces a Prototype-conditioned Mixture-of-Experts framework for multimodal federated learning, effectively addressing missing modalities. It outperforms existing methods on four chest X-ray datasets, demonstrating superior feature synthesis capabilities in both homogeneous and heterogeneous settings.

#LLM #AI Coding #AI Startup #Enterprise AI

Video2LoRA: Parametric Video Internalization for Vision-Language Models

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CV

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

-Guided ANN Index Optimization for Human-Object Interaction Retrieval

SeeSE3: Emergence of 3D Space in Vision Features

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CV

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

SeeSE3: Emergence of 3D Space in Vision Features

-Guided ANN Index Optimization for Human-Object Interaction Retrieval