MemoBench: Benchmarking World Modeling in Dynamically Changing Environments
Quick Answer
MemoBench introduces a new benchmark for evaluating memory consistency in video generation models under dynamic conditions, focusing on the disappear-and-reappear paradigm.
Quick Take
MemoBench introduces a new benchmark for evaluating memory consistency in video generation models under dynamic conditions, focusing on the disappear-and-reappear paradigm. It includes 360 ground-truth clips and assesses eight state-of-the-art models, revealing critical insights into memory challenges in changing environments.
Key Points
- MemoBench evaluates memory consistency in dynamically changing environments.
- The benchmark includes 360 ground-truth clips from synthetic and real-world scenes.
- It assesses models based on a disappear-and-reappear paradigm.
- Eight state-of-the-art models were evaluated, revealing significant memory challenges.
- Combines automated metrics with VQA-based assessments across four diagnostic pillars.
Paper Resources
📖 Reader Mode
~2 min readAuthors:Haoyu Chen, Kaichen Zhou, Hang Hua, Kaile Zhang, Jingwen Qian, Wufei Ma, Haonan Chen, Chunjiang Liu, Yizhou Zhao, Xiaoyuan Wang, Weiyue Li, Alan Yuille, Paul Pu Liang, Yilun Du
Abstract:Video generation models aspire to simulate dynamic environments, and several benchmarks now evaluate memory consistency across frames. However, most assess consistency only while the target remains in view, and the few that force objects out of view evaluate static scenes where nothing changes during occlusion. To bridge this gap, we introduce MemoBench, a diagnostic benchmark built around the disappear-and-reappear paradigm in dynamically changing environments: a target object undergoes a physical process, disappears from view, and must be correctly recovered in its updated state upon reappearance. We curate 360 ground-truth clips spanning synthetic and real-world scenes, and design an evaluation suite combining automated metrics with VQA-based assessment across four diagnostic pillars. Evaluation of eight state-of-the-art models reveals key insights and open challenges regarding memory consistency under the disappear-and-reappear paradigm.
| Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2606.27537 [cs.CV] |
| (or arXiv:2606.27537v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2606.27537 arXiv-issued DOI via DataCite |
Submission history
From: Haoyu Chen [view email]
[v1]
Thu, 25 Jun 2026 20:37:39 UTC (9,565 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.