CapTalk: Text-Guided Stylization and Speech-Driven 3D Head Animation
Quick Take
The proposed framework enables real-time 3D facial animation driven by audio, allowing separate control over speaking style and emotion. By utilizing a large-scale dataset with textual descriptions, it addresses limitations of existing methods that rely on fixed styles, resulting in more expressive and synchronized animations.
Key Points
- Model supports dynamic emotion control during inference for varied emotional content.
- Addresses limitations of existing models that use fixed identity or style features.
- Utilizes a large-scale dataset with textual descriptions for improved animation.
- Enables real-time generation of synchronized lip movements and facial expressions.
- Improves user control over speaking styles and emotional expression.
Article Content
From source RSS / original summaryarXiv:2605. 29316v1 Announce Type: new Abstract: Audio-driven 3D facial animation aims to generate synchronized lip movements and vivid facial expressions from arbitrary audio clips. While existing methods can produce synchronized lip motions, they often rely on predefined identity or style latent features, which limits users' ability to freely control speaking styles.
Moreover, applying a fixed style or identity to an entire audio segment typically results in facial animation styles that do not adapt to the emotional content of the audio. To address these challenges, we revisit the entanglement between style and emotion, construct a large-scale dataset with textual descriptions of both style and emotion, and propose a novel talking head generation framework that enables separate control over style and emotion.
Our model takes as input both textual descriptions of speaking style and character emotion, as well as the driving audio stream, enabling real-time generation of highly synchronized lip movements and facial expressions that match the provided descriptions. Furthermore, our model supports dynamic emotion control during inference, allowing it to handle scenarios where the target emotion changes throughout the speech.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
