What-If World: A Causal Benchmark for General World Models in Embodied Scenarios
Quick Take
The What-If World benchmark introduces 319 prompt pairs to evaluate video generation models on causal consistency, revealing that no tested model exceeds 52% accuracy, with open-source models averaging 28%. This indicates significant limitations in current models' ability to support reliable action-conditioned simulations.
Key Points
- What-If World includes 319 prompt pairs based on real frames from nuScenes and DROID.
- Models scored using APEO showed no system exceeded 52% on causal consistency.
- Open-source models clustered around 28%, indicating substantial room for improvement.
- Performance correlates more with visual prominence than with underlying physics tractability.
- Some subtle interventions scored as low as 14.2%, while pronounced ones reached 40.4%.
Article Content
From source RSS / original summaryarXiv:2605. 27589v1 Announce Type: new Abstract: Video generation models are increasingly used as world simulators for tasks like driving and robotic manipulation. What matters in these settings is not whether a single video looks right, but whether the model's output changes when its input changes. We test this by giving a model two prompts describing the same scene with one physical detail varied, and checking whether the two videos diverge the way physics predicts.
The wording difference between the prompts is small by design, since only one variable is changed, but the correct physical difference is not. A model that misses this can still produce two videos that each look plausible individually, and existing benchmarks score videos one at a time and cannot detect this failure. We introduce What-If World, 319 such prompt pairs built on real frames from nuScenes and DROID, organized by a taxonomy of six physical variables shared across driving and manipulation.
Each pair is scored with APEO, a four-part rubric checking whether each video follows its prompt (Adherence), is physically consistent (Physics), preserves the shared scene (Environment), and ends in the correct difference (Outcome). Across nine state-of-the-art models, no system exceeds 52% on the paired score, and open-source models cluster near 28%.
Every model tested fails on a large fraction of causal interventions, indicating substantial room before these models can reliably support action-conditioned simulation or model-based planning. Where models do score well, performance appears to track the visual prominence of the intervention rather than the tractability of its underlying physics. Some visually subtle interventions score as low as 14. 2%, while visually pronounced ones reach 40. 4%.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
