SalsaAgent: A multimodal embodied language model for interactive dance generation

arXiv cs.CV·Payam Jome Yazdian, Zoe Stanley, Angelica Lim

5/29/2026

·~1 min·5/29/2026·en·2

Quick Answer

SalsaAgent is a novel multimodal language model that generates expressive salsa dance motions in response to human leaders and music, utilizing a two-stage token-to-diffusion pipeline.

Quick Take

SalsaAgent is a novel multimodal language model that generates expressive salsa dance motions in response to human leaders and music, utilizing a two-stage token-to-diffusion pipeline. It enhances interaction through new motion tokens and fine-tuning with skeleton dynamics, achieving significant improvements in motion quality and coordination over existing baselines.

Key Points

Introduces new motion and relation tokens for enhanced interaction.
Utilizes fine-tuning with skeleton dynamics for better token grounding.
Achieves significant improvements in motion quality and coordination.
Demonstrates effective two-person spatial behavior in evaluations.
Supports socially aware robots and interactive virtual agents.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2605. 29219v1 Announce Type: new Abstract: Interaction between humanoids involves bidirectional and nonverbal reactivity, coordination and synchrony. Toward socially aware robots and interactive virtual agents, we present SalsaAgent, a language model that generates expressive, full-body salsa dance motions in reaction to a human leader and against a contextual music backdrop.

We formulate interaction as nonverbal motion token passing, extending the vocabulary of a large language model (LLM) to process discrete motion tokens, pairwise relation tokens, and audio. Our contributions include new tokens for full-body and motion relations, LLM fine-tuning using automatically derived text descriptions of skeleton dynamics for token grounding, and a two-stage token-to-diffusion pipeline.

Subjective and objective evaluations demonstrate the effectiveness of our approach in terms of motion quality, music and partner coordination, and consistent two-person spatial behavior, with significant improvements over baselines.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Aavash Chhetri, Bibek Niroula, Eduard Vazquez, Yash Raj Shrestha, Prashnna Gyawali, Loris Bazzani, Binod Bhattarai

5d ago

FeaturedOriginal

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

AI Summary

ProMoE-FL introduces a Prototype-conditioned Mixture-of-Experts framework for multimodal federated learning, effectively addressing missing modalities. It outperforms existing methods on four chest X-ray datasets, demonstrating superior feature synthesis capabilities in both homogeneous and heterogeneous settings.

#LLM #AI Coding #AI Startup #Enterprise AI