Diffusion Transformer World-Action Model for AV Scene Prediction
Quick Answer
This paper shows that The Diffusion Transformer (DiT) model predicts future scenes for autonomous vehicles with 40% lower steering RMSE compared to traditional encoders, achieving KID scores of 0.078 versus 0.375 for regression, demonstrating superior action-controllability and practical deployment without ground truth.
Quick Take
The Diffusion Transformer (DiT) model predicts future scenes for autonomous vehicles with 40% lower steering RMSE compared to traditional encoders, achieving KID scores of 0.078 versus 0.375 for regression, demonstrating superior action-controllability and practical deployment without ground truth.
Key Points
- DiT uses a latent world model to predict scenes based on ego-actions.
- Achieves 40% lower steering RMSE compared to the best single-frame encoder.
- Diffusion model shows KID score of 0.078, outperforming regression by 4.8x.
- Action-controllability is validated with Spearman correlation of 0.81.
- Compact 1.7M-parameter model recovers full ground-truth motion magnitude.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 12987v1 Announce Type: new Abstract: Action-conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and simulation without real-world rollouts, but at compact, trainable scale the futures are ambiguous and the field's standard distortion metrics actively mislead: they reward a blurry regression mean over a realistic prediction.
We confront this with a compact latent world model that, given the present front-camera latent and a sequence of ego-actions, predicts future scene latents a frozen decoder renders to $256 \times 256$ frames up to 8 seconds ahead, evaluated on 150 held-out nuScenes scenes. We first benchmark where to predict: across six frozen encoders spanning four representation families, V-JEPA2 with temporal context reduces steering RMSE by 40% over the best single-frame encoder.
We then train a latent Diffusion Transformer (DiT) and, through a controlled diagnosis, identify the four ingredients it needs: spatial tokens, the $x_0$ objective, residual anchoring, and sampling matched to target uncertainty. In a Stable-Diffusion-VAE encode-predict-decode pipeline we expose the central tension: distortion metrics (cosine similarity, SSIM) favor the blurry mean, masking that the diffusion model is far closer to the real frame distribution.
Inception-based FID and KID reveal a clean perception-distortion frontier: diffusion attains KID 0. 078 versus 0. 375 for regression ($4. 8\times$ better), and a deployable train-derived calibration makes this practical without test-time ground truth. The model is genuinely action-controllable (steering drives scene displacement, Spearman $\rho = 0. 81$, vs $-0. 18$ for regression). We trace limited single-pass motion to a shared-present anchor and engineer a compact 1.
7M-parameter "jump" model that recovers full ground-truth motion magnitude ($1. 02\times$ GT), where single-pass models capture less than half.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.