ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?
Quick Answer
ImageWAM presents a novel approach to World Action Models (WAMs) by utilizing image editing instead of video generation, achieving 1/6 the FLOPs and 1/4 the latency of traditional video-based models.
Quick Take
ImageWAM presents a novel approach to World Action Models (WAMs) by utilizing image editing instead of video generation, achieving 1/6 the FLOPs and 1/4 the latency of traditional video-based models. This method enhances action prediction accuracy by focusing on relevant visual changes, outperforming standard VLA baselines and competitive WAMs without additional policy pretraining across various experiments.
Key Points
- ImageWAM leverages pretrained image editing models for robot action prediction.
- It reduces computational costs to 1/6 FLOPs and latency to 1/4 of video-based WAMs.
- The model focuses on action-relevant visual differences rather than irrelevant details.
- ImageWAM outperforms standard VLA baselines in various simulator and real-world tests.
- Attention analysis shows editing caches target task-relevant change regions.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 19531v1 Announce Type: new Abstract: World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction.
These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining.
In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs.
Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.