EgoExo-WM: Unlocking Exo Video for Ego World Models

arXiv cs.CV·Danny Tran, Roberto Mart\'in-Mart\'in, Kristen Grauman

4d ago

·~2 min·5/18/2026·en·2

Quick Take

EgoExo-WM bridges exocentric video and egocentric world models for improved action prediction and planning.

Key Points

Extracts body pose from exocentric video.
Transforms exocentric video to egocentric format.
Enhances training for robot planning and AR applications.

📖 Reader Mode

~2 min read

[Submitted on 14 May 2026]

View PDF HTML (experimental)

Abstract:Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans' physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space -- and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in-the-wild exocentric data for egocentric world model training. We show that training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state. Our approach paves the way to enlist arbitrary in-the-wild videos for building powerful egocentric world models, furthering applications in robot planning and augmented-reality guidance.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2605.15477 [cs.CV]
	(or arXiv:2605.15477v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.15477 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Danny Tran [view email]
[v1] Thu, 14 May 2026 23:35:54 UTC (1,585 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

EgoExo-WM: Unlocking Exo Video for Ego World Models

Quick Take

Key Points

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CV

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Related in this space

AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines