PanoWorld: Geometry-Consistent Panoramic Video World Modeling

arXiv cs.CV·Le Jiang, Xiangyu Bai, Bishoy Galoaa, Shayda Moezzi, Caleb James Lee, Tooba Imtiaz, Edmund Yeh, Jennifer Dy, Yanzhi Wang, Sarah Ostadabbas

4d ago

·~2 min·5/18/2026·en·1

Quick Take

PanoWorld generates geometry-consistent panoramic videos from single images, enhancing depth and motion accuracy.

Key Points

Addresses depth and motion inconsistencies in panoramic videos.
Introduces depth and trajectory consistency regularizers.
PanoGeo dataset supports training and evaluation.

📖 Reader Mode

~2 min read

[Submitted on 14 May 2026]

View PDF HTML (experimental)

Abstract:We present PanoWorld, a panoramic video world model that generates geometry-consistent 360$\degree$ video from a single image and a caption. Existing panoramic video methods optimize primarily for visual realism and do not explicitly constrain the underlying 3D scene state, producing outputs that appear plausible yet exhibit inconsistent depth, broken correspondences, and implausible motion across the spherical surface. We address this gap by framing panoramic video generation as a geometry- and dynamics-consistent latent state modeling problem rather than pure visual synthesis. Building on a pre-trained perspective video world model, we introduce two lightweight regularizers: a depth consistency loss against pseudo ground-truth panoramic depth, and a trajectory consistency loss that supervises the 3D world-frame positions of tracked points across time. We further apply spherical-geometry-aware adaptation to the conditioning and positional encoding. We additionally introduce PanoGeo, a unified geometry-aware panoramic video dataset with consistent depth, trajectory, and prompt annotations across diverse real and synthetic sources, used for both training and stratified evaluation. Experiments show that PanoWorld improves geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, establishing that panoramic video generation must be treated as a geometric modeling problem to support the holistic spatial understanding requirements of embodied AI applications. Code is available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.15391 [cs.CV]
	(or arXiv:2605.15391v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.15391 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Sarah Ostadabbas [view email]
[v1] Thu, 14 May 2026 20:24:23 UTC (7,755 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

PanoWorld: Geometry-Consistent Panoramic Video World Modeling

Quick Take

Key Points

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CV

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Related in this space

AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines