BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

arXiv cs.CV·Shaohao Rui, Xiaofeng Mao, Zhanyu Zhang, Peijia Lin, Yansong Zhu, Yibo Zhang, Haibin Wan, Weijie Ma

3d ago

·~2 min·6/10/2026·en·0

Quick Answer

BiWM introduces a bidirectional autoregressive framework for interactive video world models, enhancing quality and speed while reducing training stages from four to two.

Quick Take

BiWM introduces a bidirectional autoregressive framework for interactive video world models, enhancing quality and speed while reducing training stages from four to two. It supports various models and enables real-world camera control, addressing limitations found in existing frameworks like minWM.

Key Points

BiWM optimizes video world models with just two training stages instead of four.
The framework supports models like Wan2.1-1.3B and HunyuanVideo-1.5-8B.
Integrates history compression techniques for improved long rollout capabilities.
Open-sourced for resource-constrained research and high-fidelity simulations.
Utilizes GAN objectives to counteract degradation in scene dynamics.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 10135v1 Announce Type: new Abstract: Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but existing causal pipelines need many stages (control fine-tuning, autoregressive training, causal initialization, few-step distillation) and still trail bidirectional models in quality due to error accumulation. Recent world models such as Yume-1. 5 and Matrix-Game-3.

0 instead adopt a bidirectional autoregressive approach, gaining fidelity and stable long-horizon rollout from self-correcting error propagation, yet open-source frameworks (e. g. , minWM) support only causal models. We present BiWM, the first full-stack framework for interactive video world models under the bidirectional autoregressive paradigm, jointly optimizing generation quality and inference speed.

From a pretrained video backbone, BiWM injects camera control by fine-tuning, then runs a few-step Distribution Matching Distillation (DMD) stage that turns the backbone into an action/camera-controllable world model: just two training stages instead of four in minWM, converging in a few hundred steps on 8xH200 GPUs. A single recipe spans Wan2. 1-1. 3B, Wan2. 2-5B, HunyuanVideo-1. 5-8B, and LTX-2. 3-22B, and also supports secondary fine-tuning of existing bidirectional models.

BiWM enables real-world camera control where minWM loses controllability, integrates pluggable history compression (FramePack-style and PackForcing-style) for long rollouts, and offers an optional NVFP4 4-bit training/inference pipeline. To counter DMD's mode-seeking degradation, we add GAN and mass-covering forward-KL objectives that preserve scene dynamics. We open-source BiWM for resource-constrained research and high-fidelity environment simulation.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup