ReWorld: Learning Better Representations for World Action Models

arXiv cs.CV·Tianze Xia, Lijun Zhou, Kaixin Xiong, Jingfeng Yao, Yu Zhu, Zhenxin Zhu, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Haiyang Sun, Xinggang Wang

2d ago

·~2 min·6/29/2026·en·0

Quick Answer

Quick Take

ReWorld introduces a novel representation learning framework for World Action Models (WAMs) in autonomous driving, enhancing video generation performance by 23.9% in FVD and improving closed-loop PDMS from 89.1 to 90.4 without post-training methods. The framework optimizes intermediate representations directly, significantly accelerating convergence by approximately 2x on benchmarks like nuScenes and NAVSIM.

Key Points

ReWorld is the first framework for representation learning in WAMs.
Achieved a 23.9% improvement in FVD, reducing it from 81.3 to 61.9.
Closed-loop PDMS increased from 89.1 to 90.4 without post-training.
Accelerated convergence by approximately 2x on nuScenes and NAVSIM.
Focuses on optimizing intermediate representations for better planning.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 25 Jun 2026]

Authors:Tianze Xia, Lijun Zhou, Kaixin Xiong, Jingfeng Yao, Yu Zhu, Zhenxin Zhu, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Haiyang Sun, Xinggang Wang

View PDF HTML (experimental)

Abstract:World Action Models (WAMs) model future environment evolution under action conditioning, offering a scalable paradigm for autonomous driving. However, existing approaches focus largely on model architecture design, and how a WAM can efficiently learn better world representations for planning remains underexplored. To address this gap, we propose ReWorld, the first representation learning framework specifically designed for autonomous-driving world action models. In WAMs, standard training supervises only the output ends of the generation and planning modules, leaving the intermediate representations that carry world knowledge to be shaped only indirectly, as byproducts of fitting these outputs. The core idea of ReWorld is to treat intermediate representations as direct targets of optimization, shaping them along three complementary dimensions. On the Video DiT responsible for generation, we impose future-predictive supervision on its intermediate representations. On the Action DiT responsible for planning, we first align its intermediate representations cross-modally with the video world representation, then further shape them to be discriminative around safety-critical boundaries via hard-negative supervision. In addition, we systematically analyze the effectiveness of existing representation learning methods in video generation world models, and discuss why their performance is limited on this task. Experiments on nuScenes and NAVSIM show that ReWorld improves fine-tuned video generation by 23.9% in FVD (81.3 to 61.9), raises closed-loop PDMS from 89.1 to 90.4 without any post-training such as RL or post-processing, and accelerates from-scratch convergence by approximately 2x.

Comments:	19 pages,3 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2606.27504 [cs.CV]
	(or arXiv:2606.27504v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2606.27504 arXiv-issued DOI via DataCite

Submission history

From: Tianze Xia [view email]
[v1] Thu, 25 Jun 2026 19:37:58 UTC (923 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

3w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup