WAM4D: Fast 4D World Action Model via Spatial Register Tokens
Quick Answer
WAM4D introduces a fast 4D world action model that leverages spatial register tokens for efficient causal action generation, enhancing spatial consistency and action prediction in real-world tasks.
Quick Take
WAM4D introduces a fast 4D world action model that leverages spatial register tokens for efficient causal action generation, enhancing spatial consistency and action prediction in real-world tasks. It outperforms previous models on RoboTwin 2.0 while maintaining lightweight inference.
Key Points
- WAM4D utilizes lightweight spatial register tokens for future-depth readouts.
- The model improves spatial consistency and action prediction in real-world tasks.
- Causal mixture attention is designed to prevent non-causal shortcuts.
- WAM4D shows competitive performance on RoboTwin 2.0 benchmark.
- Efficient inference is maintained by removing the register branch during action inference.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 14048v1 Announce Type: new Abstract: World action models (WAMs) have recently shown promise in jointly modeling future observations and executable robot actions. However, most existing WAMs still operate in 2D video or latent spaces, where visually plausible rollouts miss the 3D spatial constraints and occluded contact geometry required for precise manipulation.
While geometric foundation models offer strong priors for recovering dense 3D structure and motion from visual observations, forcing WAMs to predict the dense 4D representation introduces costly geometric decoding and slows down causal action generation.
To address the trade-off, we present WAM4D, a fast 4D world action model that uses lightweight spatial register tokens as training-time future-depth readouts to transfer pretrained geometric priors into a causal video-action transformer, then removes the register branch for lightweight action inference. To prevent non-causal shortcuts, we further design causal mixture attention for the Mixture-of-Transformers (MoT) WAM backbone, defining modality-specific visibility among video, action, and geometry tokens.
Comprehensive experiments on RoboTwin 2. 0 and challenging real-world manipulation tasks show that WAM4D improves spatial consistency and achieves competitive action prediction while maintaining efficient inference.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.