Segmenting, Fast and Slow: Real-Time Open-Vocabulary Video Instance Segmentation with Dual-Path Processing
Quick Answer
The SegFS framework introduces a dual-stream approach for real-time open-vocabulary video instance segmentation, achieving up to 14x lower latency than the MOBIUS model while maintaining competitive performance on standard benchmarks.
Quick Take
The SegFS framework introduces a dual-stream approach for real-time open-vocabulary video instance segmentation, achieving up to 14x lower latency than the MOBIUS model while maintaining competitive performance on standard benchmarks. This innovation allows efficient temporal propagation and decouples semantic understanding from mask prediction, making it suitable for mobile devices.
Key Points
- SegFS uses a dual-stream fast-slow framework for improved efficiency.
- Achieves 14x lower latency than the mobile-oriented MOBIUS model.
- Maintains competitive segmentation performance on OV-VIS benchmarks.
- Decouples multimodal semantic understanding from dense mask prediction.
- Enables efficient temporal propagation for real-time applications.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 00124v1 Announce Type: new Abstract: Object-centric models inspired by DETR have become the dominant paradigm for open-vocabulary video instance segmentation (OV-VIS). While recent efforts have reduced the computational cost of pixel decoding, textual modality fusion, and object decoding to make these architectures more suitable for mobile devices, real-time on-device inference at high frame rates remains an open challenge.
In this paper, we introduce SegFS, a dual-stream fast-slow framework that significantly improves efficiency without sacrificing accuracy. On sparse keyframes, an open-vocabulary object-based model predicts instance-level representations. These representations are then projected back into the backbone feature space to condition a lightweight fast network, which efficiently relocalizes and segments the instances in subsequent frames.
By shifting instance propagation from object decoding to feature-space conditioning, our approach decouples multimodal semantic understanding from dense mask prediction and enables efficient temporal propagation. The proposed fast branch achieves up to 14x lower latency than the mobile-oriented MOBIUS model, while maintaining competitive segmentation performance on standard OV-VIS benchmarks.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.