RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation
Quick Take
RoMo is a new large-scale dataset for human motion generation, featuring a taxonomy-aware filtering pipeline that enhances quality by removing low-quality sequences. Models trained on RoMo achieve state-of-the-art fidelity and diversity, enabling better understanding of complex text prompts. The Motion Toolbox is released to standardize metrics and support reproducible research.
Key Points
- RoMo resolves trade-offs between small, high-fidelity and large, low-quality motion datasets.
- The dataset features a three-level semantic taxonomy for detailed evaluation.
- Models trained on RoMo show superior understanding of complex text prompts.
- Motion Toolbox standardizes metrics and data visualization for reproducible research.
- Quality filtering pipeline removes static and artifact-prone sequences effectively.
Article Content
From source RSS / original summaryarXiv:2605. 26241v1 Announce Type: new Abstract: Success in generative modeling across language, image, and video demonstrates that large, well-curated datasets are the key driver for building capable models. 3D Human motion, however, has lagged behind, constrained by an unsatisfying choice between small, high-fidelity motion capture datasets and large-scale in-the-wild collections dominated by static or low-quality sequences.
We introduce RoMo, a rich, large-scale, carefully curated dataset of in-the-wild human motions that resolves these tradeoffs. To ensure quality, we introduce a taxonomy-aware filtering pipeline that aggressively removes static and artifact-prone sequences. Every sequence is annotated with detailed captions and organized by a novel three-level semantic taxonomy. This hierarchical structure enables fine-grained, per-category evaluation, that reveals model strengths and weaknesses obscured by global metrics.
We demonstrate that models trained on RoMo achieve state-of-the-art fidelity and diversity while gaining a superior understanding of complex, subtle text prompts. Finally, we release the Motion Toolbox to standardize metrics, data conversion, and visualization, establishing a foundation for reproducible and interpretable motion generation research.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.