Diffusion Transformer World-Action Model for AV Scene Prediction | AI Deep Signal

Diffusion Transformer World-Action Model for AV Scene Prediction

arXiv cs.CV·Ruslan Sharifullin, Benjamin Jiang, Kai Xi Chew

6/12/2026

·~2 min·6/12/2026·en·1

Quick Answer

This paper shows that The Diffusion Transformer (DiT) model predicts future scenes for autonomous vehicles with 40% lower steering RMSE compared to traditional encoders, achieving KID scores of 0.078 versus 0.375 for regression, demonstrating superior action-controllability and practical deployment without ground truth.

Key Points

DiT uses a latent world model to predict scenes based on ego-actions.
Achieves 40% lower steering RMSE compared to the best single-frame encoder.
Diffusion model shows KID score of 0.078, outperforming regression by 4.8x.
Action-controllability is validated with Spearman correlation of 0.81.
Compact 1.7M-parameter model recovers full ground-truth motion magnitude.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 12987v1 Announce Type: new Abstract: Action-conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and simulation without real-world rollouts, but at compact, trainable scale the futures are ambiguous and the field's standard distortion metrics actively mislead: they reward a blurry regression mean over a realistic prediction.

We confront this with a compact latent world model that, given the present front-camera latent and a sequence of ego-actions, predicts future scene latents a frozen decoder renders to $256 \times 256$ frames up to 8 seconds ahead, evaluated on 150 held-out nuScenes scenes. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Aavash Chhetri, Bibek Niroula, Eduard Vazquez, Yash Raj Shrestha, Prashnna Gyawali, Loris Bazzani, Binod Bhattarai

2w ago

FeaturedOriginal

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

AI Summary

ProMoE-FL introduces a Prototype-conditioned Mixture-of-Experts framework for multimodal federated learning, effectively addressing missing modalities. It outperforms existing methods on four chest X-ray datasets, demonstrating superior feature synthesis capabilities in both homogeneous and heterogeneous settings.

#LLM #AI Coding #AI Startup #Enterprise AI

Diffusion Transformer World-Action Model for AV Scene Prediction

Quick Answer

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CV

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

-Guided ANN Index Optimization for Human-Object Interaction Retrieval

PCA: Persistence-Aware Compression and Aggregation for Fast Video

Quick Answer

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CV

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

PCA: Persistence-Aware Compression and Aggregation for Fast Video Large Language Models

-Guided ANN Index Optimization for Human-Object Interaction Retrieval

PCA: Persistence-Aware Compression and Aggregation for Fast Video