MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

arXiv cs.CL·Manh Luong, Tamas Abraham, Junae Kim, Amar Kaur, Rollin Omari, Gholamreza Haffari, Trang Vu, Lizhen Qu, Dinh Phung

2d ago

·~1 min·6/5/2026·en·1

Quick Answer

MCBench introduces a new benchmark for assessing Omni Large Language Models (LLMs) across 1196 scenarios in four safety categories, revealing significant challenges in cross-modal reasoning.

Quick Take

MCBench introduces a new benchmark for assessing Omni Large Language Models (LLMs) across 1196 scenarios in four safety categories, revealing significant challenges in cross-modal reasoning. Current models struggle with subtle risks but perform better with clear visual or acoustic cues, highlighting the need for improved architectures and training strategies.

Key Points

MCBench features 1196 scenarios for safety assessment of Omni LLMs.
Models struggle with subtle risks but perform better with clear cues.
Safety categories include visual, audio, and text inputs.
Current architectures lack robust cross-modal reasoning capabilities.
Need for enhanced training strategies for multimodal safety.

Article Excerpt

From source RSS / original summary

arXiv:2606. 05177v1 Announce Type: new Abstract: Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four safety categories that require integrating multiple modalities for accurate safety assessment. Each unsafe scenario is paired with a minimally different safe counterpart to assess model sensitivity.

Our evaluations of state-of-the-art models reveal significant challenges. Omni LLMs struggle with subtle or non-physical risks but perform better when salient visual or acoustic cues are present. Analysis of reasoning traces shows that, although models can extract modality-specific information, they often fail to integrate these cues effectively for safety judgments.

Our findings reveal that current Omni LLMs lack robust cross-modal reasoning in safety-critical settings, underscoring the need for improved architectures and training strategies for multimodal safety.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

2w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy