Beyond MoCap: Scaling Motion Tokenizers with Synthetic Human Motion for Generative Modeling

arXiv cs.CV·Yiwen Yan, Wanning He, Yu-Wing Tai

2d ago

·~2 min·6/29/2026·en·0

Quick Answer

Quick Take

This study introduces a framework that enhances motion generation by integrating large-scale synthetic human motion with a redesigned VQ-VAE tokenizer, significantly improving the diversity and compositionality of learned motion vocabularies. The approach demonstrates consistent performance gains in tasks like text-to-motion and motion continuation, indicating that expanding the motion representation space is crucial for better generalization in human motion synthesis.

Key Points

Proposes a data generation pipeline for diverse synthetic human motion sequences.
Integrates a redesigned VQ-VAE tokenizer to adapt to expanded motion space.
Demonstrates improved coverage and compositionality in learned motion vocabularies.
Achieves consistent gains in text-to-motion and motion continuation tasks.
Highlights the importance of scaling synthetic motion for better representation learning.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 25 Jun 2026]

View PDF HTML (experimental)

Abstract:Human motion generation models are fundamentally constrained by the limited diversity of motion capture datasets, which predominantly contain common, repetitive actions and fail to cover the long tail of complex human movements, resulting in a restricted motion vocabulary in learned latent representations and poor generalization to rare, compositional, and highly dynamic motions. In this work, we propose a framework for expanding the motion representation space by leveraging large-scale synthetic human motion, introducing a data generation pipeline that produces diverse, physically plausible motion sequences beyond the distribution of existing datasets and integrating it with a redesigned VQ-VAE tokenizer that adapts to this expanded motion space. Unlike conventional tokenizers trained on narrow data distributions, our approach jointly scales both the training distribution and the discrete codebook, enabling the model to capture a significantly richer set of motion primitives. We demonstrate that training with synthetic motion substantially improves the coverage and compositionality of the learned motion vocabulary, leading to consistent gains across motion generation tasks such as text-to-motion and motion continuation, while remaining fully compatible with existing frameworks including MotionGPT. Our results suggest that the primary bottleneck lies in the limited support of the learned motion representation, rather than model architecture alone. Scaling synthetic motion in tandem with representation learning offers a principled path toward more expressive, controllable, and generalizable human motion synthesis.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2606.27547 [cs.CV]
	(or arXiv:2606.27547v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2606.27547 arXiv-issued DOI via DataCite

Submission history

From: Yiwen Yan [view email]
[v1] Thu, 25 Jun 2026 20:50:31 UTC (41,822 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

3w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup