MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention

arXiv cs.CV·Pengyu Wang, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou, Zhan Huang, Zhen Ye, Jijun Cheng, Xiaomeng Qian, Yanxin Chen, Xingyang He, Huazheng Zeng, Chenghao Wang, Pengfei Wang, Hongkai Wang, Shanqing Gao, Yixian Tian, Chenghao Liu, Xinghao Wang, Botian Jiang, Xipeng Qiu

2h ago

·~2 min·6/9/2026·en·0

Quick Answer

Quick Take

MOSS-Video-Preview introduces a two-channel architecture for real-time video understanding, achieving a 5x speedup in time to first token and 2.7x higher decoding throughput compared to traditional models. While trailing the Qwen2.5-VL-7B baseline, it excels in continuous perception and answer revision, marking a significant step towards interactive video processing.

Key Points

MOSS-Video-Preview enables real-time interaction by separating perception and generation pathways.
Achieves 5x faster time to first token and 2.7x higher decoding throughput on a single H200.
Utilizes a cross-attention backbone for improved vision-language fusion over decoder-only designs.
Specializes an offline model to enhance real-time behaviors like answer revision and timely silence.
Maintains competitive offline video understanding despite a performance gap attributed to data and scale.

Article Content

From source RSS / original summary

arXiv:2606. 07639v1 Announce Type: new Abstract: Video understanding is shifting from the offline paradigm -- taking a fully recorded video as input and producing a single answer after it ends -- toward real-time interaction, in which the model perceives new frames while still replying, revises its answer as new evidence appears, and remains silent when there is nothing to say. We present MOSS-Video-Preview to validate this paradigm.

Our central claim is that perception must not be blocked by generation; its natural realization is a two-channel architecture.

We argue that a cross-attention backbone is better suited to real-time vision-language fusion than the prevailing decoder-only design: visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways -- reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression.

We complement this with a data synthesis pipeline that converts dense captions into real-time understanding QA whose answers are revised to match what the model has perceived so far, and we specialize an offline model on these data to elicit real-time behavior. Our model trails the strong Qwen2.

5-VL-7B baseline overall -- a gap we attribute primarily to data and scale rather than the architecture -- yet attains competitive offline video and multimodal understanding, remains robust on the spatial and fine-grained temporal reasoning central to real-time use, and acquires behaviors that offline models lack: continuous perception, answer revision, and timely silence. On a single H200 with 256 frames per video, it achieves about a 5x speedup in time to first token and 2.

7x higher decoding throughput, with negligible degradation in offline ability. Our study of paradigm, architecture, and data outlines a viable path toward real-time video understanding.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

4d ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup

MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention

Quick Answer

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CV

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

SlideCheck: Guiding Self-Supervised Pretraining of Pathology Foundation Models via Dataset Distributions

Biomazon: A Multimodal Dataset for 3D Forest Structure and Biomass Modeling in the Amazon Basin

Related in this space

The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

Aptiv to Deliver Production-Ready Edge AI with Long-Term Support with NVIDIA