Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach
Quick Answer
This paper shows that This thesis presents a privacy-preserving approach to group emotion recognition (GER) using collective audio-video signals instead of individual cues.
Quick Take
This thesis presents a privacy-preserving approach to group emotion recognition (GER) using collective audio-video signals instead of individual cues. It introduces two frameworks: a cross-attention multimodal architecture for audio-video fusion and a Variational Encoder Multi-Decoder (VE-MD) for emotion classification, achieving robust performance in real-world conditions without relying on individual features.
Key Points
- Proposes a cross-attention multimodal architecture for effective audio-video fusion.
- Introduces Variational Encoder Multi-Decoder (VE-MD) for emotion classification.
- Demonstrates robustness in group emotion recognition under real-world conditions.
- Reduces risks of individual monitoring by focusing on collective signals.
- Achieves competitive performance without using individual features.
Article Content
From source RSS / original summaryarXiv:2606. 07585v1 Announce Type: new Abstract: This thesis addresses group emotion recognition (GER) in-the-wild with a focus on privacy preservation. Unlike traditional emotion recognition methods that rely on individual-level cues such as face, gaze, or voice analysis, this work uses collective audio-video signals to infer emotions at the group level, reducing risks of individual monitoring and surveillance. Two complementary frameworks are proposed.
The first is a cross-attention multimodal architecture for audio-video fusion, combined with Frames Attention Pooling (FAP) for temporal aggregation. It is supported by synthetic data augmentation and validated through ablation studies, demonstrating robustness in real-world GER conditions. The second framework, Variational Encoder Multi-Decoder (VE-MD), learns a shared latent space for emotion classification and structural representation prediction, including body and face cues.
Two decoding strategies, DETR-based and heatmap-based, are explored to analyze the role of structural representations in group and individual settings. The thesis makes three main contributions: it clarifies the role of multimodality and structural cues in group-level affective computing; introduces two architectures for privacy-preserving multimodal GER; and shows that competitive performance can be achieved without using individual features as input data.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.
