Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

arXiv cs.CL·Varsha Suresh, Mohammad Mahdi Abootorabi, Mohamed Salman, M. Hamza Mughal, Christian Theobalt, Ashwin Ram, J\"urgen Steimle, Vera Demberg

4h ago

·~2 min·6/1/2026·en·0

Quick Take

The proposed semantic motion anchors improve gesture retrieval by 8.2% on the BEAT2 benchmark, effectively bridging spoken text and gestures. This method enhances semantic understanding by using structured descriptions of gestures, outperforming previous models in both text-to-gesture and gesture-to-text retrieval tasks.

Key Points

Semantic motion anchors discretize 3D gestures into body-hand motion primitives.
Method improves text-to-gesture retrieval R@1 by 8.2% over direct text-motion baseline.
Outperforms previous approaches in both text-to-gesture and gesture-to-text retrieval.
Retrieval-augmented generation shows user preference for semantically grounded gestures.
Enhances communicative intent in gesture generation for spoken queries.

Article Content

From source RSS / original summary

arXiv:2605. 30608v1 Announce Type: new Abstract: Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is not captured by motion alone. Direct contrastive alignment between transcripts and continuous motion embeddings often overemphasizes low-level kinematics and misses the symbolic content of semantic gestures.

We propose semantic motion anchors, natural-language abstractions of gesture motion capturing physical form and communicative intent. Our method discretizes 3D gestures into body-hand motion primitives, verbalizes them into structured descriptions, and grounds them in the transcript to provide auxiliary contrastive supervision. On BEAT2, our method improves text-to-gesture R@1 by 8.

2% over a direct text-motion baseline and outperforms prior retrieval approaches on text to gesture and gesture to text retrieval directions. Beyond aggregate retrieval metrics, semantic motion anchor supervision helps retrieve gestures that are semantically meaningful for the spoken query, rather than defaulting to generic motion patterns.

A downstream retrieval-augmented gesture generation study showed that users significantly preferred gestures retrieved by our approach over a retrieval-augmented generation baseline, demonstrating that semantically grounded retrieval translates to gestures that better convey communicative intent in downstream generation.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

1w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy