Do vision-language models search like humans? Reasoning tokens as a reaction-time analog in classic visual-search paradigms
Quick Answer
This paper shows that Vision-language models (VLMs) demonstrate behavioral signatures similar to humans in visual search tasks, with frontier models maintaining accuracy while mid-tier models fail.
Quick Take
(VLMs) demonstrate behavioral signatures similar to humans in visual search tasks, with frontier models maintaining accuracy while mid-tier models fail. The study reveals that VLMs exhibit unique patterns, such as a reversed target-present effort slope and accurate enumeration, suggesting psychophysical paradigms effectively probe machine visual cognition.
Key Points
- Frontier VLMs maintain accuracy while mid-tier models collapse to chance performance.
- Feature search shows flat effort, while conjunction search effort increases with set size.
- VLMs exhibit reversed target-present effort slope compared to human performance.
- Enumeration accuracy in VLMs remains intact, unlike human performance.
- Psychophysical paradigms serve as effective probes for machine visual cognition.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Visual search has been one of the most productive paradigms in the study of visual attention: the way reaction time scales with the number of items distinguishes parallel, "pop-out" search from serial, attention-demanding search. I ask whether vision-language models (VLMs) exhibit the same behavioral signatures. I adapt four classic paradigms: feature versus conjunction search, spatial-configuration (T-vs-L) search, enumeration, and the tilted/vertical search asymmetry; and present them to current frontier and mid-tier models. Because a single model call has no reaction time, I use the number of reasoning ("thinking") tokens a model spends per trial as a within-model analog of search effort, and I compare against a large public human benchmark (Wolfe et al., 2010). The models reproduce several human signatures: feature search costs flat effort while conjunction effort climbs with set size; frontier models hold accuracy where mid-tier models collapse to chance; and a resolution control shows the conjunction cost is genuine search rather than difficulty resolving small shapes. They also diverge from humans in informative ways. The target-present effort slope exceeds the target-absent slope, reversing the human ordering; enumeration remains accurate where humans would lose count; and a reasoning model with adaptive deliberation declines to deliberate on detection tasks altogether, so that a single search expresses itself as an effort gradient in one model and as an accuracy cliff in another. I argue that psychophysical paradigms, applied behaviorally, are a sharp and inexpensive probe of machine visual cognition, and that the points of divergence are as informative as the points of agreement.
| Subjects: | Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2606.25066 [cs.AI] |
| (or arXiv:2606.25066v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2606.25066 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Farahnaz Wick [view email]
[v1]
Tue, 23 Jun 2026 18:19:16 UTC (976 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Verification Horizon: No Silver Bullet for Coding Agent Rewards
As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.