Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures
Quick Take
The proposed semantic motion anchors improve gesture retrieval by 8.2% on the BEAT2 benchmark, effectively bridging spoken text and gestures. This method enhances semantic understanding by using structured descriptions of gestures, outperforming previous models in both text-to-gesture and gesture-to-text retrieval tasks.
Key Points
- Semantic motion anchors discretize 3D gestures into body-hand motion primitives.
- Method improves text-to-gesture retrieval R@1 by 8.2% over direct text-motion baseline.
- Outperforms previous approaches in both text-to-gesture and gesture-to-text retrieval.
- Retrieval-augmented generation shows user preference for semantically grounded gestures.
- Enhances communicative intent in gesture generation for spoken queries.
Article Content
From source RSS / original summaryarXiv:2605. 30608v1 Announce Type: new Abstract: Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is not captured by motion alone. Direct contrastive alignment between transcripts and continuous motion embeddings often overemphasizes low-level kinematics and misses the symbolic content of semantic gestures.
We propose semantic motion anchors, natural-language abstractions of gesture motion capturing physical form and communicative intent. Our method discretizes 3D gestures into body-hand motion primitives, verbalizes them into structured descriptions, and grounds them in the transcript to provide auxiliary contrastive supervision. On BEAT2, our method improves text-to-gesture R@1 by 8.
2% over a direct text-motion baseline and outperforms prior retrieval approaches on text to gesture and gesture to text retrieval directions. Beyond aggregate retrieval metrics, semantic motion anchor supervision helps retrieve gestures that are semantically meaningful for the spoken query, rather than defaulting to generic motion patterns.
A downstream retrieval-augmented gesture generation study showed that users significantly preferred gestures retrieved by our approach over a retrieval-augmented generation baseline, demonstrating that semantically grounded retrieval translates to gestures that better convey communicative intent in downstream generation.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.