Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning

arXiv cs.CV·Sieu Tran, Duc Nguyen, Hao Vo, Khoa Vo, Ngan Le

1d ago

·~2 min·6/12/2026·en·1

Quick Answer

Quick Take

The proposed Dual-State Slot Attention (DSSA) improves unsupervised video object-centric learning by decoupling appearance and identity, enhancing segmentation quality and temporal consistency on benchmarks like MOVi-C and YouTube-VIS. DSSA outperforms prior methods, addressing issues of slot swapping and weakly attending slots, leading to better downstream object recognition and video dynamics prediction.

Key Points

DSSA separates local state for appearance and identity state for temporal stability.
Introduces competition-modulated aggregation to reduce updates from weakly matching slots.
Demonstrates improved segmentation quality on MOVi-C, MOVi-D, and YouTube-VIS.
Enhances downstream object recognition and video dynamics prediction.
Code and models will be publicly available upon acceptance.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 12601v1 Announce Type: new Abstract: Unsupervised video object-centric learning aims to decompose dynamic scenes into persistent, object-level representations without supervision. However, existing slot-based methods struggle to maintain stable object identity in challenging settings such as rapid motion and partial occlusion.

First, they typically encode both the per-frame appearance of an object and its identity across frames in a single slot vector, creating an objective conflict that leads to slot swapping: reconstruction requires sensitivity to transient visual changes, whereas temporal consistency requires invariance to them. Second, the token renormalization used in Slot Attention can amplify weakly attending slots, allowing them to absorb tokens from other objects and destabilize slot-to-object correspondence.

We propose Dual-State Slot Attention (DSSA), a fully self-supervised framework that addresses these limitations by separating appearance from identity and by reducing spurious updates from weakly matching slots. DSSA decomposes each slot into a local state for per-frame appearance and an identity state for temporally stable object information, thereby aligning reconstruction and temporal consistency with separate representations.

The identity state is updated through a learned recurrent transition that acts as a temporal filter on the local state, while competition-modulated aggregation (CMA) down-weights updates from weakly matching slots and prevents them from absorbing tokens from other objects. Experiments on MOVi-C, MOVi-D, and YouTube-VIS demonstrate that DSSA consistently improves segmentation quality and temporal consistency over prior methods, while also yielding stronger downstream object recognition and video dynamics prediction.

Code and models will be made publicly available upon acceptance.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup