GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation
Quick Take
GEM-4D is a geometry-grounded video world model that enhances robot manipulation by integrating 4D correspondence supervision, improving real-world task success from 61% to 81%. This model achieves state-of-the-art performance in video prediction and geometric consistency, enabling reliable action execution in both simulated and real environments.
Key Points
- GEM-4D uses dense 4D correspondence supervision from a pretrained geometry model.
- The model maintains a single-stream architecture with no additional inference cost.
- An inverse dynamics module converts video rollouts into executable robot trajectories.
- Achieves state-of-the-art performance in video prediction and geometric consistency.
- Real-world manipulation success improved from 61% to 81%.
Article Content
From source RSS / original summaryarXiv:2605. 22882v1 Announce Type: new Abstract: Video world models can generate realistic futures from a single instruction, but they often fail to preserve consistent point-level motion over time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation.
We present GEM-4D, a geometry-grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision, distilled from a pretrained geometry foundation model, into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single-stream architecture with no additional inference cost.
We further introduce an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories, enabling direct deployment in both real-world and simulated manipulation. GEM-4D achieves state-of-the-art performance on both video prediction and geometric consistency across simulation and realistic scenarios and improves real-world manipulation success from 61% to 81%. Additional results are available at the project page: https://anonymous-submission-20. github. io/gem.
github. io/.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
