Neural Voxel Dynamics: Learning Implicit 3D Physics via Volumetric Feature Advection

arXiv cs.CV·Zican Wang, Niloy Mitra

5d ago

·~2 min·6/26/2026·en·9

Quick Answer

The proposed self-supervised framework learns implicit 3D physics from video signals using a Volumetric Latent Space, achieving high structural stability and physical plausibility on benchmarks like CLEVERER and PhysInOne, without relying on traditional physics engines.

Quick Take

Key Points

Introduces Volumetric Feature Advection for learning 3D physics from videos.
Achieves good performance on CLEVERER, PhysInOne, and PhysGaia benchmarks.
No reliance on physics engine states or labels during training.
Tracks material states implicitly within high-dimensional V-JEPA features.
Enables simulation of complex phenomena like rigid body motion in fluid.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 24 Jun 2026]

View PDF HTML (experimental)

Abstract:We present a self-supervised framework for learning implicit 3D physical dynamics directly from video-derived supervisory signals. While current generative video models achieve high visual fidelity, they lack a 3D geometric foundation, often resulting in physical inconsistencies and a failure to maintain object permanence. We address this by shifting the predictive bottleneck from 2D image space to a `lifted' 3D Volumetric Latent Space. Our method unprojects semantic features from a Video Joint-Embedding Predictive Architecture (V-JEPA) into a voxelized grid, grounded by monocular depth priors. This lifting enables a Volumetric Feature Advection to learn an action-conditioned transition operator that treats physics as a spatio-temporal state advection problem, i.e., learn implicit 3D physics. Unlike state-of-the-art hybrid models that rely on explicit classical simulators for training and/or inference, our architecture tracks material states implicitly within high-dimensional V-JEPA features. This allows for the emergent simulation of heterogeneous phenomena (e.g., rigid body motion in fluid flow) within a single, unified pipeline. Supervised solely via end-to-end video-derived signal plus action conditions, without access to physics engine internal states, labels, or surrogate models, our model demonstrates good long-term structural stability and physical plausibility on multiple benchmarks (CLEVERER, PhysInOne, PhysGaia). We believe that this work opens a scalable pathway toward general-purpose dynamic world models that internalize the 3D invariants of the physical world solely through passive observation of monocular videos.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2606.26410 [cs.CV]
	(or arXiv:2606.26410v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2606.26410 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Zican Wang [view email]
[v1] Wed, 24 Jun 2026 22:06:35 UTC (27,859 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

3w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup