DIMOS: Disentangling Instance-level Moving Object Segmentation
Quick Answer
DIMOS introduces a dual-disentangling feature extraction framework for moving instance segmentation (MIS) that enhances performance, particularly for small instances in challenging conditions.
Quick Take
DIMOS introduces a dual-disentangling feature extraction framework for moving instance segmentation (MIS) that enhances performance, particularly for small instances in challenging conditions. By effectively separating appearance and motion information from event and image modalities, it achieves state-of-the-art results in multimodal MIS, outperforming existing methods in fast motion and low-light scenarios.
Key Points
- Introduces a dual-disentangling framework for enhanced feature extraction in MIS.
- Achieves state-of-the-art performance in multimodal MIS, especially for small instances.
- Addresses challenges of sparse features from event cameras in low-resolution settings.
- Implements multi-granularity cross-modal alignment for effective feature fusion.
- Demonstrates superior results in fast motion and low-light conditions.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 12826v1 Announce Type: new Abstract: Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS.
However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion.
To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details.
The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.