DIMOS: Disentangling Instance-level Moving Object Segmentation

arXiv cs.CV·Hongxiang Huang, Hongwei Ren, Xiaopeng Lin, Yulong Huang, Zeke Xie, Bojun Cheng

1d ago

·~1 min·6/12/2026·en·0

Quick Answer

DIMOS introduces a dual-disentangling feature extraction framework for moving instance segmentation (MIS) that enhances performance, particularly for small instances in challenging conditions.

Quick Take

DIMOS introduces a dual-disentangling feature extraction framework for moving instance segmentation (MIS) that enhances performance, particularly for small instances in challenging conditions. By effectively separating appearance and motion information from event and image modalities, it achieves state-of-the-art results in multimodal MIS, outperforming existing methods in fast motion and low-light scenarios.

Key Points

Introduces a dual-disentangling framework for enhanced feature extraction in MIS.
Achieves state-of-the-art performance in multimodal MIS, especially for small instances.
Addresses challenges of sparse features from event cameras in low-resolution settings.
Implements multi-granularity cross-modal alignment for effective feature fusion.
Demonstrates superior results in fast motion and low-light conditions.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 12826v1 Announce Type: new Abstract: Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS.

However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion.

To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details.

The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup