AnyAct: Towards Human Reenactment of Character Motion From Video
Quick Take
AnyAct enables human reenactment from non-human character videos using sparse local motion cues.
Key Points
- Focuses on motion reinterpretation rather than character reconstruction.
- Introduces novel designs for effective human motion generation.
- Demonstrates high-fidelity reenactments preserving character dynamics.
📖 Reader Mode
~2 min readAbstract:We study the problem of directly deriving an initial human reenactment from a monocular video of a non-human character. Our goal is not to reconstruct the source character itself but to reinterpret its motion as a plausible and editable human performance for downstream animation authoring. This task is challenging because existing video-based motion capture methods are largely restricted to human-centric structural spaces, while motion retargeting methods typically require structured 3D source motions and known source topologies. Our key insight is that sparse local articulated motion cues can preserve essential dynamics across large structural differences, providing a stable bridge from character video to human reenactment. Based on this observation, we propose AnyAct, which formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion. To make this practical, we introduce three key designs: human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to alleviate conditioning ambiguity, and global-local motion decoupling for reliable local motion control. We further construct a benchmark primarily covering diverse non-human character videos. Experiments on the benchmark show that AnyAct produces high-fidelity initial human reenactments that preserve the essential dynamics of the characters in reference videos, and further ablation studies validate the effectiveness of its core designs.
| Comments: | 12 pages |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) |
| Cite as: | arXiv:2605.15497 [cs.CV] |
| (or arXiv:2605.15497v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.15497 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Lei Zhong [view email]
[v1]
Fri, 15 May 2026 00:23:36 UTC (34,002 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
GeoSym127K introduces a scalable neuro-symbolic framework for enhanced geometric reasoning in multimodal models.