Nano World Models: A Minimalist Implementation of Future Video Prediction
Quick Take
Nano World Models introduces a minimalist framework for future video prediction, emphasizing diffusion forcing. It offers a unified interface for generative objectives and facilitates controlled studies on various parameters affecting video prediction quality across diverse environments.
Key Points
- Nano World Models aims for compact, reproducible, and extensible implementations.
- The framework supports various generative objectives and model scales.
- Experiments conducted in simple control environments and real-robot data.
- Key parameters studied include architecture scale, action injection, and domain complexity.
- Code, configurations, and pretrained checkpoints are publicly released for research.
Article Content
From source RSS / original summaryarXiv:2605. 23993v1 Announce Type: new Abstract: World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision-making. Yet, despite rapid progress in industry-scale interactive video generation, the broader research community still lacks compact, reproducible, and easily extensible implementations for studying the design choices underlying modern world models.
We introduce Nano World Models, a minimalist codebase for future video prediction centered around diffusion forcing. Nano World Models provides a unified interface for generative objectives, model scales, action-conditioning mechanisms, latent observation spaces, datasets, evaluation protocols, and long-horizon rollout procedures. This design enables controlled studies of world-modeling components that are often entangled across separate implementations.
Through experiments across simple control environments, game simulation, and real-robot data, we examine how prediction parameterization, architecture scale, action injection, sampling budget, and domain complexity affect video prediction quality and autoregressive rollout behavior. By releasing code, configurations, evaluation scripts, and pretrained checkpoints, Nano World Models aims to provide a compact yet extensible experimental substrate for open, reproducible, and scientific world-model research.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.