Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

arXiv cs.AI·Zhe Yang, Ruyi Zhang, Hongtao Chen, Wenrui Li, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan

3h ago

·~2 min·6/8/2026·en·0

Quick Answer

The proposed Hierarchical Semantic-Constrained Heterogeneous Graph (HSCHG) framework significantly enhances open-vocabulary audio-visual event localization by addressing challenges in semantic consistency and temporal alignment, outperforming existing methods on the OV-AVEL benchmark with extensive experimental validation.

Quick Take

Key Points

HSCHG constructs a heterogeneous hierarchical graph for audio-visual event localization.
Utilizes multi-directional temporal edges to capture temporal information effectively.
Employs a dual-threshold filtering strategy for improved cross-modal information alignment.
Introduces bidirectional semantic constraints for segment- and video-level consistency.
Achieves superior performance on OV-AVEL benchmark compared to existing methods.

Article Content

From source RSS / original summary

arXiv:2606. 07033v1 Announce Type: new Abstract: Open-vocabulary audio-visual event localization (OV-AVEL) jointly models audio-visual cues to recognize and temporally localize events, including categories unseen during training. Existing methods primarily learn joint audio-visual representations in Euclidean space, but still face two significant challenges. First, the lack of supervision signals for unseen categories makes it difficult to maintain audio-visual consistency across multiple temporal scales.

Second, the lack of hierarchical constraints between segment- and video-level semantics prevents the model from establishing semantic consistency across different levels. To address these challenges, we propose a hierarchical semantic constrained heterogeneous graph (HSCHG) for audio-visual event localization framework. We first construct a heterogeneous hierarchical graph in Euclidean space, which includes audio and visual segment nodes and their corresponding video-level nodes.

We use multi-directional temporal edges to capture complete temporal information within each modality. Simultaneously, we employ a dual-threshold filtering gated fusion strategy, introducing cross-modal information only when the alignment confidence is high. Furthermore, we introduce bidirectional semantic constraints between segment- and video-level representations to achieve semantic consistency across different levels.

Based on this, we map the multi-level audio-visual representations and text prototypes uniformly into hyperbolic space. We use a hierarchical entailment regularization loss to characterize the hierarchical relationships between videos and segments. Extensive experimental results show that our method outperforms existing methods on the OV-AVEL benchmark. Ablation studies further validate the effectiveness of our method.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Xiaoou Liu, Tiejin Chen, Weibo Li, Xiyang Hu, Hua Wei

3h ago

FeaturedOriginal

The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

AI Summary

This paper addresses the sim-to-real gap for foundation model agents by framing it within a Markov Decision Process (MDP) structure. It advocates for established solutions like domain randomization to enhance agent robustness, aiming to create standardized benchmarks for reliable real-world applications.

#Agent #Robotics #AI Startup #Policy