WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

arXiv cs.CV·Yida Yin, Harish Krishnakumar, Chung Peng Lee, Boya Zeng, Wenhao Chai, Shengbang Tong, Wenhu Chen, Hu Xu, Xingyu Fu, Gabriel Sarch, Aleksandra Korolova, Zhuang Liu

3h ago

·~1 min·6/8/2026·en·0

Quick Answer

WorldBench is a new multimodal reasoning benchmark designed to evaluate Multimodal Large Language Models (MLLMs) across diverse visual concepts.

Quick Take

WorldBench is a new multimodal reasoning benchmark designed to evaluate Multimodal Large Language Models (MLLMs) across diverse visual concepts. It reveals significant weaknesses in visual understanding, with the best-performing model achieving only 64.0% accuracy. This highlights the critical need for visual diversity in multimodal benchmarks.

Key Points

WorldBench features a taxonomy of thousands of visual concepts across multiple domains.
It achieves higher visual diversity than any existing multimodal benchmark.
Evaluated 15 MLLMs, revealing significant weaknesses in visual understanding.
The strongest model only reached 64.0% accuracy, with others near chance-level.
The benchmark emphasizes the necessity of visual diversity in model evaluation.

Article Excerpt

From source RSS / original summary

arXiv:2606. 06538v1 Announce Type: new Abstract: In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e. g. , living things).

Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.

0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

3d ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup