Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach

arXiv cs.CV·Anderson Augusma

2h ago

·~1 min·6/9/2026·en·0

Quick Answer

This paper shows that This thesis presents a privacy-preserving approach to group emotion recognition (GER) using collective audio-video signals instead of individual cues.

Quick Take

This thesis presents a privacy-preserving approach to group emotion recognition (GER) using collective audio-video signals instead of individual cues. It introduces two frameworks: a cross-attention multimodal architecture for audio-video fusion and a Variational Encoder Multi-Decoder (VE-MD) for emotion classification, achieving robust performance in real-world conditions without relying on individual features.

Key Points

Proposes a cross-attention multimodal architecture for effective audio-video fusion.
Introduces Variational Encoder Multi-Decoder (VE-MD) for emotion classification.
Demonstrates robustness in group emotion recognition under real-world conditions.
Reduces risks of individual monitoring by focusing on collective signals.
Achieves competitive performance without using individual features.

Article Content

From source RSS / original summary

arXiv:2606. 07585v1 Announce Type: new Abstract: This thesis addresses group emotion recognition (GER) in-the-wild with a focus on privacy preservation. Unlike traditional emotion recognition methods that rely on individual-level cues such as face, gaze, or voice analysis, this work uses collective audio-video signals to infer emotions at the group level, reducing risks of individual monitoring and surveillance. Two complementary frameworks are proposed.

The first is a cross-attention multimodal architecture for audio-video fusion, combined with Frames Attention Pooling (FAP) for temporal aggregation. It is supported by synthetic data augmentation and validated through ablation studies, demonstrating robustness in real-world GER conditions. The second framework, Variational Encoder Multi-Decoder (VE-MD), learns a shared latent space for emotion classification and structural representation prediction, including body and face cues.

Two decoding strategies, DETR-based and heatmap-based, are explored to analyze the role of structural representations in group and individual settings. The thesis makes three main contributions: it clarifies the role of multimodality and structural cues in group-level affective computing; introduces two architectures for privacy-preserving multimodal GER; and shows that competitive performance can be achieved without using individual features as input data.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

4d ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup

Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach

Quick Answer

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CV

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

SlideCheck: Guiding Self-Supervised Pretraining of Pathology Foundation Models via Dataset Distributions

Biomazon: A Multimodal Dataset for 3D Forest Structure and Biomass Modeling in the Amazon Basin

Related in this space

The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

Aptiv to Deliver Production-Ready Edge AI with Long-Term Support with NVIDIA