Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization
Quick Answer
The proposed Hierarchical Semantic-Constrained Heterogeneous Graph (HSCHG) framework significantly enhances open-vocabulary audio-visual event localization by addressing challenges in semantic consistency and temporal alignment, outperforming existing methods on the OV-AVEL benchmark with extensive experimental validation.
Quick Take
The proposed Hierarchical Semantic-Constrained Heterogeneous Graph (HSCHG) framework significantly enhances open-vocabulary audio-visual event localization by addressing challenges in semantic consistency and temporal alignment, outperforming existing methods on the OV-AVEL benchmark with extensive experimental validation.
Key Points
- HSCHG constructs a heterogeneous hierarchical graph for audio-visual event localization.
- Utilizes multi-directional temporal edges to capture temporal information effectively.
- Employs a dual-threshold filtering strategy for improved cross-modal information alignment.
- Introduces bidirectional semantic constraints for segment- and video-level consistency.
- Achieves superior performance on OV-AVEL benchmark compared to existing methods.
Article Content
From source RSS / original summaryarXiv:2606. 07033v1 Announce Type: new Abstract: Open-vocabulary audio-visual event localization (OV-AVEL) jointly models audio-visual cues to recognize and temporally localize events, including categories unseen during training. Existing methods primarily learn joint audio-visual representations in Euclidean space, but still face two significant challenges. First, the lack of supervision signals for unseen categories makes it difficult to maintain audio-visual consistency across multiple temporal scales.
Second, the lack of hierarchical constraints between segment- and video-level semantics prevents the model from establishing semantic consistency across different levels. To address these challenges, we propose a hierarchical semantic constrained heterogeneous graph (HSCHG) for audio-visual event localization framework. We first construct a heterogeneous hierarchical graph in Euclidean space, which includes audio and visual segment nodes and their corresponding video-level nodes.
We use multi-directional temporal edges to capture complete temporal information within each modality. Simultaneously, we employ a dual-threshold filtering gated fusion strategy, introducing cross-modal information only when the alignment confidence is high. Furthermore, we introduce bidirectional semantic constraints between segment- and video-level representations to achieve semantic consistency across different levels.
Based on this, we map the multi-level audio-visual representations and text prototypes uniformly into hyperbolic space. We use a hierarchical entailment regularization loss to characterize the hierarchical relationships between videos and segments. Extensive experimental results show that our method outperforms existing methods on the OV-AVEL benchmark. Ablation studies further validate the effectiveness of our method.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
This paper addresses the sim-to-real gap for foundation model agents by framing it within a Markov Decision Process (MDP) structure. It advocates for established solutions like domain randomization to enhance agent robustness, aiming to create standardized benchmarks for reliable real-world applications.