Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

arXiv cs.CV·Mohammad Mahdi Abootorabi, Omid Ghahroodi, Anas Madkoor, Marzia Nouri, Doratossadat Dastgheib, Mohamed Hefeeda, Ehsaneddin Asgari

2d ago

·~2 min·6/5/2026·en·1

Quick Answer

The Almieyar-Oryx-BloomBench introduces a bilingual (English-Arabic) benchmark for evaluating Vision-Language Models (VLMs) based on cognitive levels.

Quick Take

The Almieyar-Oryx-BloomBench introduces a bilingual (English-Arabic) benchmark for evaluating Vision-Language Models (VLMs) based on cognitive levels. It reveals that while top models excel in semantic understanding, they struggle with factual recall and creative synthesis, highlighting significant performance gaps between languages. This benchmark aims to foster more cognitively aligned and inclusive VLMs.

Key Points

BloomBench evaluates VLMs using six cognitive levels based on Bloom's Taxonomy.
State-of-the-art models show strong semantic understanding but weak factual recall.
Significant performance gap exists between Arabic and English in multimodal reasoning.
The benchmark framework ensures scalability, cultural inclusivity, and linguistic fidelity.
Dataset and framework available at GitHub for further research.

Article Content

From source RSS / original summary

arXiv:2606. 05531v1 Announce Type: new Abstract: Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement.

To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks.

Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis.

This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github. com/qcri/Almieyar-Oryx-BloomBench.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

2d ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup