WAM4D: Fast 4D World Action Model via Spatial Register Tokens

arXiv cs.CV·Ying Li, Xiaobao Wei, Jiajun Cao, Hao Wang, Xiaowei Chi, Chengyu Bai, Qianpu Sun, Jiajun Li, Xiaojie Zhang, Jian Tang, Sirui Han, Shanghang Zhang

6h ago

·~1 min·6/15/2026·en·0

Quick Answer

WAM4D introduces a fast 4D world action model that leverages spatial register tokens for efficient causal action generation, enhancing spatial consistency and action prediction in real-world tasks.

Quick Take

WAM4D introduces a fast 4D world action model that leverages spatial register tokens for efficient causal action generation, enhancing spatial consistency and action prediction in real-world tasks. It outperforms previous models on RoboTwin 2.0 while maintaining lightweight inference.

Key Points

WAM4D utilizes lightweight spatial register tokens for future-depth readouts.
The model improves spatial consistency and action prediction in real-world tasks.
Causal mixture attention is designed to prevent non-causal shortcuts.
WAM4D shows competitive performance on RoboTwin 2.0 benchmark.
Efficient inference is maintained by removing the register branch during action inference.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 14048v1 Announce Type: new Abstract: World action models (WAMs) have recently shown promise in jointly modeling future observations and executable robot actions. However, most existing WAMs still operate in 2D video or latent spaces, where visually plausible rollouts miss the 3D spatial constraints and occluded contact geometry required for precise manipulation.

While geometric foundation models offer strong priors for recovering dense 3D structure and motion from visual observations, forcing WAMs to predict the dense 4D representation introduces costly geometric decoding and slows down causal action generation.

To address the trade-off, we present WAM4D, a fast 4D world action model that uses lightweight spatial register tokens as training-time future-depth readouts to transfer pretrained geometric priors into a causal video-action transformer, then removes the register branch for lightweight action inference. To prevent non-causal shortcuts, we further design causal mixture attention for the Mixture-of-Transformers (MoT) WAM backbone, defining modality-specific visibility among video, action, and geometry tokens.

Comprehensive experiments on RoboTwin 2. 0 and challenging real-world manipulation tasks show that WAM4D improves spatial consistency and achieves competitive action prediction while maintaining efficient inference.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup