MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention
Quick Answer
MOSS-Video-Preview introduces a two-channel architecture for real-time video understanding, achieving a 5x speedup in time to first token and 2.7x higher decoding throughput compared to traditional models.
Quick Take
MOSS-Video-Preview introduces a two-channel architecture for real-time video understanding, achieving a 5x speedup in time to first token and 2.7x higher decoding throughput compared to traditional models. While trailing the Qwen2.5-VL-7B baseline, it excels in continuous perception and answer revision, marking a significant step towards interactive video processing.
Key Points
- MOSS-Video-Preview enables real-time interaction by separating perception and generation pathways.
- Achieves 5x faster time to first token and 2.7x higher decoding throughput on a single H200.
- Utilizes a cross-attention backbone for improved vision-language fusion over decoder-only designs.
- Specializes an offline model to enhance real-time behaviors like answer revision and timely silence.
- Maintains competitive offline video understanding despite a performance gap attributed to data and scale.
Article Content
From source RSS / original summaryarXiv:2606. 07639v1 Announce Type: new Abstract: Video understanding is shifting from the offline paradigm -- taking a fully recorded video as input and producing a single answer after it ends -- toward real-time interaction, in which the model perceives new frames while still replying, revises its answer as new evidence appears, and remains silent when there is nothing to say. We present MOSS-Video-Preview to validate this paradigm.
Our central claim is that perception must not be blocked by generation; its natural realization is a two-channel architecture.
We argue that a cross-attention backbone is better suited to real-time vision-language fusion than the prevailing decoder-only design: visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways -- reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression.
We complement this with a data synthesis pipeline that converts dense captions into real-time understanding QA whose answers are revised to match what the model has perceived so far, and we specialize an offline model on these data to elicit real-time behavior. Our model trails the strong Qwen2.
5-VL-7B baseline overall -- a gap we attribute primarily to data and scale rather than the architecture -- yet attains competitive offline video and multimodal understanding, remains robust on the spatial and fine-grained temporal reasoning central to real-time use, and acquires behaviors that offline models lack: continuous perception, answer revision, and timely silence. On a single H200 with 256 frames per video, it achieves about a 5x speedup in time to first token and 2.
7x higher decoding throughput, with negligible degradation in offline ability. Our study of paradigm, architecture, and data outlines a viable path toward real-time video understanding.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.
