BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression
Quick Answer
BiWM introduces a bidirectional autoregressive framework for interactive video world models, enhancing quality and speed while reducing training stages from four to two.
Quick Take
BiWM introduces a bidirectional autoregressive framework for interactive video world models, enhancing quality and speed while reducing training stages from four to two. It supports various models and enables real-world camera control, addressing limitations found in existing frameworks like minWM.
Key Points
- BiWM optimizes video world models with just two training stages instead of four.
- The framework supports models like Wan2.1-1.3B and HunyuanVideo-1.5-8B.
- Integrates history compression techniques for improved long rollout capabilities.
- Open-sourced for resource-constrained research and high-fidelity simulations.
- Utilizes GAN objectives to counteract degradation in scene dynamics.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10135v1 Announce Type: new Abstract: Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but existing causal pipelines need many stages (control fine-tuning, autoregressive training, causal initialization, few-step distillation) and still trail bidirectional models in quality due to error accumulation. Recent world models such as Yume-1. 5 and Matrix-Game-3.
0 instead adopt a bidirectional autoregressive approach, gaining fidelity and stable long-horizon rollout from self-correcting error propagation, yet open-source frameworks (e. g. , minWM) support only causal models. We present BiWM, the first full-stack framework for interactive video world models under the bidirectional autoregressive paradigm, jointly optimizing generation quality and inference speed.
From a pretrained video backbone, BiWM injects camera control by fine-tuning, then runs a few-step Distribution Matching Distillation (DMD) stage that turns the backbone into an action/camera-controllable world model: just two training stages instead of four in minWM, converging in a few hundred steps on 8xH200 GPUs. A single recipe spans Wan2. 1-1. 3B, Wan2. 2-5B, HunyuanVideo-1. 5-8B, and LTX-2. 3-22B, and also supports secondary fine-tuning of existing bidirectional models.
BiWM enables real-world camera control where minWM loses controllability, integrates pluggable history compression (FramePack-style and PackForcing-style) for long rollouts, and offers an optional NVFP4 4-bit training/inference pipeline. To counter DMD's mode-seeking degradation, we add GAN and mass-covering forward-KL objectives that preserve scene dynamics. We open-source BiWM for resource-constrained research and high-fidelity environment simulation.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.