RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation

arXiv cs.CV·Jiahao Zhang, Joseph Liu, Young-Yoon Lee, Seonghyeon Moon, Victor Zordan, Guy Tevet, Karen Liu, Stephen Gould, Oren Jacob, Haomiao Jiang, Mubbasir Kapadia, Yizhak Ben-Shabat

3d ago

·~1 min·5/27/2026·en·0

Quick Take

RoMo is a new large-scale dataset for human motion generation, featuring a taxonomy-aware filtering pipeline that enhances quality by removing low-quality sequences. Models trained on RoMo achieve state-of-the-art fidelity and diversity, enabling better understanding of complex text prompts. The Motion Toolbox is released to standardize metrics and support reproducible research.

Key Points

RoMo resolves trade-offs between small, high-fidelity and large, low-quality motion datasets.
The dataset features a three-level semantic taxonomy for detailed evaluation.
Models trained on RoMo show superior understanding of complex text prompts.
Motion Toolbox standardizes metrics and data visualization for reproducible research.
Quality filtering pipeline removes static and artifact-prone sequences effectively.

Article Content

From source RSS / original summary

arXiv:2605. 26241v1 Announce Type: new Abstract: Success in generative modeling across language, image, and video demonstrates that large, well-curated datasets are the key driver for building capable models. 3D Human motion, however, has lagged behind, constrained by an unsatisfying choice between small, high-fidelity motion capture datasets and large-scale in-the-wild collections dominated by static or low-quality sequences.

We introduce RoMo, a rich, large-scale, carefully curated dataset of in-the-wild human motions that resolves these tradeoffs. To ensure quality, we introduce a taxonomy-aware filtering pipeline that aggressively removes static and artifact-prone sequences. Every sequence is annotated with detailed captions and organized by a novel three-level semantic taxonomy. This hierarchical structure enables fine-grained, per-category evaluation, that reveals model strengths and weaknesses obscured by global metrics.

We demonstrate that models trained on RoMo achieve state-of-the-art fidelity and diversity while gaining a superior understanding of complex, subtle text prompts. Finally, we release the Motion Toolbox to standardize metrics, data conversion, and visualization, establishing a foundation for reproducible and interpretable motion generation research.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Taha Koleilat, Hassan Rivaz, Yiming Xiao

3d ago

FeaturedOriginal

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

AI Summary

Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.

#AI Coding #Inference #Open Source