AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes
Quick Take
AVTrack introduces a new audio-visual instance segmentation dataset for complex human-centric scenes, addressing limitations of existing datasets. Evaluations show significant performance degradation in current AVIS methods, establishing AVTrack as a challenging benchmark for spatiotemporal modeling and cross-modal reasoning.
Key Points
- AVTrack features diverse conditions like camera motion and visual occlusions.
- Existing datasets are limited to simple audio-visual scenes with coarse annotations.
- Current AVIS methods show substantial performance degradation on AVTrack.
- AVTrack aims to enhance human-computer interaction and intelligent video editing.
- A baseline is provided to facilitate future research in audio-visual tracking.
Article Content
From source RSS / original summaryarXiv:2606. 02724v1 Announce Type: new Abstract: Audio-visual speaker tracking aims to localize and track active speakers by leveraging auditory and visual cues, enabling fine-grained, human-centric scene understanding. This capability is essential for real-world applications such as intelligent video editing, surveillance, and human-computer interaction. However, existing datasets are largely limited to simple or homogeneous audio-visual scenes with coarse annotations.
Such oversimplified settings bias evaluation toward static audio-visual co-occurrence, rather than rigorously assessing robust spatiotemporal modeling and cross-modal reasoning in complex, dynamic scenes. To address these limitations, we introduce AVTrack, a human-centric audio-visual instance segmentation (AVIS) dataset designed for dynamic real-world scenarios. AVTrack features diverse and challenging conditions, including camera motion, visual occlusions, and position changes.
Evaluations of representative AVIS methods on AVTrack reveal substantial performance degradation, establishing AVTrack as a challenging benchmark for robust human-centric audio-visual scene understanding in complex environments. We further provide a simple yet effective baseline to facilitate future research. Project website: https://FudanCVL. github. io/AVTrack/
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records
Plan2Map introduces a 208-case benchmark for reconstructing geospatial boundaries from UK planning documents. The GeoPlanAgent system achieves a mean IoU of 0.736, significantly outperforming baseline models, highlighting the challenges in localization and map registration.