DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation
Quick Take
DuoGesture introduces a dual-stream approach for co-speech gesture generation, integrating semantic and rhythmic motions through a Semantic Variational Information Bottleneck. This model outperforms traditional holistic methods in both objective evaluations and subjective experiments, enhancing semantic grounding and rhythmic consistency.
Key Points
- DuoGesture decomposes gesture synthesis into semantic and beat streams for improved expressivity.
- Utilizes Motion-Grounded Semantic Conditioning for motion-aligned semantic representations.
- Incorporates an Inertial Beat Prior to enhance rhythmic consistency without limiting semantics.
- Achieves superior performance over strong holistic baselines in evaluations.
- Component ablations confirm the importance of semantic grounding and biomechanical regularization.
Article Content
From source RSS / original summaryarXiv:2605. 26236v1 Announce Type: new Abstract: Co-speech gesture generation requires both semantic expressivity and biomechanically plausible rhythmic motion. Existing holistic gesture models mix lexically grounded semantic gestures with frequent prosody-aligned beat gestures. This limits semantic grounding, speech-motion alignment, and kinematic smoothness.
We propose \emph{DuoGesture}, a neuro-inspired and biomechanically informed dual-stream approach that decomposes co-speech gesture synthesis into coupled semantic and beat streams. The two streams are coordinated by a \emph{Semantic Variational Information Bottleneck}, a stochastic frame-level gate that learns when semantic gestures should override rhythmic beat motion.
The semantic stream is controlled by \emph{Motion-Grounded Semantic Conditioning}, which replaces purely linguistic word embeddings with motion-language representations to provide motion-aligned semantic priors for long-tailed lexical triggers of gestures. The beat stream is further regularised by an \emph{Inertial Beat Prior}, an anthropometry-weighted arm-chain module that reduces jitter and improves rhythmic consistency without constraining semantic frames.
Objective evaluations and subjective experiments show that DuoGesture outperforms strong holistic baselines, while component ablations confirm the complementary roles of semantic grounding, stochastic stream selection, and biomechanical regularisation.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
