MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models
Quick Answer
MCBench introduces a new benchmark for assessing Omni Large Language Models (LLMs) across 1196 scenarios in four safety categories, revealing significant challenges in cross-modal reasoning.
Quick Take
MCBench introduces a new benchmark for assessing Omni Large Language Models (LLMs) across 1196 scenarios in four safety categories, revealing significant challenges in cross-modal reasoning. Current models struggle with subtle risks but perform better with clear visual or acoustic cues, highlighting the need for improved architectures and training strategies.
Key Points
- MCBench features 1196 scenarios for safety assessment of Omni LLMs.
- Models struggle with subtle risks but perform better with clear cues.
- Safety categories include visual, audio, and text inputs.
- Current architectures lack robust cross-modal reasoning capabilities.
- Need for enhanced training strategies for multimodal safety.
Article Excerpt
From source RSS / original summaryarXiv:2606. 05177v1 Announce Type: new Abstract: Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four safety categories that require integrating multiple modalities for accurate safety assessment. Each unsafe scenario is paired with a minimally different safe counterpart to assess model sensitivity.
Our evaluations of state-of-the-art models reveal significant challenges. Omni LLMs struggle with subtle or non-physical risks but perform better when salient visual or acoustic cues are present. Analysis of reasoning traces shows that, although models can extract modality-specific information, they often fail to integrate these cues effectively for safety judgments.
Our findings reveal that current Omni LLMs lack robust cross-modal reasoning in safety-critical settings, underscoring the need for improved architectures and training strategies for multimodal safety.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.