MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence

arXiv cs.CV·Yifan Chen, Fei Yin, Qingyan Bai, Zicheng Lin, Yujiu Yang

3d ago

·~2 min·5/14/2026·en·2

Quick Take

MMCL-Bench is a benchmark for multimodal context learning from visual evidence and rules.

Key Points

Focuses on task-local rules and procedures.
Evaluates multimodal models on 102 diverse tasks.
Identifies significant gaps in current multimodal learning.

📖 Reader Mode

~2 min read

[Submitted on 12 May 2026]

View PDF HTML (experimental)

Abstract:We introduce MMCL-Bench, a benchmark for multimodal context learning: learning task-local rules, procedures, and empirical patterns from visual or mixed-modality teaching context and applying them to new visual instances. Unlike text-only context learning or standard multimodal question answering, this setting requires models to recover and localize relevant evidence from images, screenshots, manuals, videos, and frame sequences before they can reason over the learned context. MMCL-Bench contains 102 tasks spanning three categories: rule system application, procedural task execution, and empirical discovery and induction. We evaluate frontier multimodal models with strict rubric-based scoring and find that current systems remain far from robust multimodal context learning, with even the strongest model solving fewer than one-third of tasks under strict evaluation. Diagnostic ablations and error analysis show that failures arise throughout the context-to-answer pipeline, including context anchoring, visual evidence extraction, context reasoning, and response construction. MMCL-Bench thus highlights multimodal context learning as an important unsolved capability bottleneck for current multimodal models.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.12703 [cs.CV]
	(or arXiv:2605.12703v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.12703 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Fei Yin [view email]
[v1] Tue, 12 May 2026 19:57:37 UTC (4,539 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CV

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

Related in this space

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Enhanced and Efficient Reasoning in Large Learning Models

Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study