SalsaAgent: A multimodal embodied language model for interactive dance generation
Quick Take
SalsaAgent is a novel multimodal language model that generates expressive salsa dance motions in response to human leaders and music, utilizing a two-stage token-to-diffusion pipeline. It enhances interaction through new motion tokens and fine-tuning with skeleton dynamics, achieving significant improvements in motion quality and coordination over existing baselines.
Key Points
- Introduces new motion and relation tokens for enhanced interaction.
- Utilizes fine-tuning with skeleton dynamics for better token grounding.
- Achieves significant improvements in motion quality and coordination.
- Demonstrates effective two-person spatial behavior in evaluations.
- Supports socially aware robots and interactive virtual agents.
Article Excerpt
From source RSS / original summaryarXiv:2605. 29219v1 Announce Type: new Abstract: Interaction between humanoids involves bidirectional and nonverbal reactivity, coordination and synchrony. Toward socially aware robots and interactive virtual agents, we present SalsaAgent, a language model that generates expressive, full-body salsa dance motions in reaction to a human leader and against a contextual music backdrop.
We formulate interaction as nonverbal motion token passing, extending the vocabulary of a large language model (LLM) to process discrete motion tokens, pairwise relation tokens, and audio. Our contributions include new tokens for full-body and motion relations, LLM fine-tuning using automatically derived text descriptions of skeleton dynamics for token grounding, and a two-stage token-to-diffusion pipeline.
Subjective and objective evaluations demonstrate the effectiveness of our approach in terms of motion quality, music and partner coordination, and consistent two-person spatial behavior, with significant improvements over baselines.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
