AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes

arXiv cs.CV·Yaoting Wang, Yun Zhou, Zipei Zhang, Henghui Ding

2h ago

·~1 min·6/3/2026·en·0

Quick Take

AVTrack introduces a new audio-visual instance segmentation dataset for complex human-centric scenes, addressing limitations of existing datasets. Evaluations show significant performance degradation in current AVIS methods, establishing AVTrack as a challenging benchmark for spatiotemporal modeling and cross-modal reasoning.

Key Points

AVTrack features diverse conditions like camera motion and visual occlusions.
Existing datasets are limited to simple audio-visual scenes with coarse annotations.
Current AVIS methods show substantial performance degradation on AVTrack.
AVTrack aims to enhance human-computer interaction and intelligent video editing.
A baseline is provided to facilitate future research in audio-visual tracking.

Article Content

From source RSS / original summary

arXiv:2606. 02724v1 Announce Type: new Abstract: Audio-visual speaker tracking aims to localize and track active speakers by leveraging auditory and visual cues, enabling fine-grained, human-centric scene understanding. This capability is essential for real-world applications such as intelligent video editing, surveillance, and human-computer interaction. However, existing datasets are largely limited to simple or homogeneous audio-visual scenes with coarse annotations.

Such oversimplified settings bias evaluation toward static audio-visual co-occurrence, rather than rigorously assessing robust spatiotemporal modeling and cross-modal reasoning in complex, dynamic scenes. To address these limitations, we introduce AVTrack, a human-centric audio-visual instance segmentation (AVIS) dataset designed for dynamic real-world scenarios. AVTrack features diverse and challenging conditions, including camera motion, visual occlusions, and position changes.

Evaluations of representative AVIS methods on AVTrack reveal substantial performance degradation, establishing AVTrack as a challenging benchmark for robust human-centric audio-visual scene understanding in complex environments. We further provide a simple yet effective baseline to facilitate future research. Project website: https://FudanCVL. github. io/AVTrack/

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Fabian Degen, Oishi Deb, Jindong Gu, Junchi Yu, Samuele Marro, Philip Torr, Jialin Yu

2h ago

Original

Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records

AI Summary

Plan2Map introduces a 208-case benchmark for reconstructing geospatial boundaries from UK planning documents. The GeoPlanAgent system achieves a mean IoU of 0.736, significantly outperforming baseline models, highlighting the challenges in localization and map registration.

#Agent #AI Coding #Inference