StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning
Quick Take
StemBind introduces a diagnostic benchmark for abstract visual reasoning (AVR) that reveals a significant gap between rule identification and answer selection in multimodal large language models (MLLMs). Evaluations of 24 models show that rule accuracy exceeds final answer accuracy in 22 cases, highlighting a persistent binding gap where models fail to connect rules to instances effectively. This benchmark aims to pinpoint where MLLMs struggle in visual reasoning tasks.
Key Points
- StemBind includes 2,298 curated stems and 19,533 P/R/F tasks across nine visual operations.
- Rule accuracy exceeds final answer accuracy in 22 of 24 evaluated MLLMs.
- Models incorrectly answer final questions 51.2% of the time even with correct perception and rules.
- The primary failure occurs in the rule-to-instance mapping stage (S3).
- Larger models and explicit thinking modes do not improve accuracy in AVR tasks.
Article Content
From source RSS / original summaryarXiv:2606. 00148v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) often know the rule but pick the wrong answer: on abstract visual reasoning (AVR) tasks, a model can describe what it sees and name the underlying pattern, yet still fail to choose the matching candidate. Existing AVR benchmarks cannot detect this because they collapse perception, rule induction, and answer selection into a single right-or-wrong signal.
We introduce StemBind, a shared-stem diagnostic benchmark that probes the same visual stem with three aligned questions: Perception (what is in the image), Rule (what pattern governs it), and Full (which option completes it), so a final-answer error can be attributed to a specific sub-step on the same evidence.
StemBind contains 2,298 curated knowledge-light stems across nine auditable visual operations, totaling 19,533 P/R/F tasks, with each full item annotated by Sternberg's four reasoning stages (S1 Encode, S2 Infer, S3 Map, S4 Apply). Evaluating 24 frontier MLLM configurations yields four findings. (i) The R-F chasm: rule accuracy exceeds full-item accuracy on 22 of 24 models, so most failures happen after the rule is identified.
(ii) A persistent binding gap: even when P and R are both correct on the same stem, models still answer F incorrectly 51. 2% of the time. (iii) The bottleneck is S3: process diagnostics and Stage-wise Stimulus Augmentation localize the dominant failure to rule-to-instance mapping. (iv) Scaling and thinking do not help: neither larger models nor explicit thinking mode reliably closes the gap, and thinking even lowers rule and full-item accuracy.
StemBind reframes AVR evaluation from final-answer ranking to locating where abstract visual reasoning breaks down, identifying rule-to-instance binding as a concrete next target for vision-grounded reasoning.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, enabling efficient fine-tuning with only 0.11% parameter updates. It significantly enhances performance in few-shot learning and domain shifts across 15 biomedical imaging datasets, demonstrating robustness for clinical applications.