MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence
Quick Take
MMCL-Bench is a benchmark for multimodal context learning from visual evidence and rules.
Key Points
- Focuses on task-local rules and procedures.
- Evaluates multimodal models on 102 diverse tasks.
- Identifies significant gaps in current multimodal learning.
📖 Reader Mode
~2 min readAbstract:We introduce MMCL-Bench, a benchmark for multimodal context learning: learning task-local rules, procedures, and empirical patterns from visual or mixed-modality teaching context and applying them to new visual instances. Unlike text-only context learning or standard multimodal question answering, this setting requires models to recover and localize relevant evidence from images, screenshots, manuals, videos, and frame sequences before they can reason over the learned context. MMCL-Bench contains 102 tasks spanning three categories: rule system application, procedural task execution, and empirical discovery and induction. We evaluate frontier multimodal models with strict rubric-based scoring and find that current systems remain far from robust multimodal context learning, with even the strongest model solving fewer than one-third of tasks under strict evaluation. Diagnostic ablations and error analysis show that failures arise throughout the context-to-answer pipeline, including context anchoring, visual evidence extraction, context reasoning, and response construction. MMCL-Bench thus highlights multimodal context learning as an important unsolved capability bottleneck for current multimodal models.
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.12703 [cs.CV] |
| (or arXiv:2605.12703v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.12703 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Fei Yin [view email]
[v1]
Tue, 12 May 2026 19:57:37 UTC (4,539 KB)
— Originally published at arxiv.org
More from arXiv cs.CV
See more →CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers
CoReDiT enhances Diffusion Transformers by optimizing token pruning for efficiency and quality.