Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception

arXiv cs.CV·Qian Zhang, Michal Golovanevsky, Fulvio Domini, James Tompkin

3h ago

·~1 min·6/8/2026·en·0

Quick Answer

This paper shows that Vision-Language Models (VLMs) struggle with slant-from-texture perception, showing significant anchoring biases at specific angles.

Quick Take

Vision-Language Models (VLMs) struggle with slant-from-texture perception, showing significant anchoring biases at specific angles. Despite some improvement through supervised fine-tuning, VLMs like CNNs fail to express geometric cues in a graded manner, impacting their performance in nuanced visual tasks.

Key Points

VLMs predict slant primarily at anchors like 0°, ±25°, and ±45°.
Zero-shot and in-context prompting reveal significant failures in slant perception.
Supervised fine-tuning partially addresses the issue, but anchoring biases remain.
High-level vision-language benchmarks may not require low-level geometric sensitivity.
The study highlights a gap in the representation-to-output language interface.

Article Excerpt

From source RSS / original summary

arXiv:2606. 06714v1 Announce Type: new Abstract: Human perception of surface slant from texture exhibits systematic, graded biases that emerge reliably in psychophysical experiments. Prior work showed that unsupervised CNNs reproduce several human-like biases, while supervised CNNs do not. Do Vision-Language Models (VLMs) exhibit similar competences?

Across multiple VLM families and model scales, zero-shot and in-context prompting both produce distinctive failures: slant is predicted at only a small set of anchors (e. g. , 0\degree, $\pm$25\degree, $\pm$45\degree) with little dependence on stimulus field of view, optical slant, or surface curvature. Supervised fine-tuning partially remediates the failure, but residual anchoring persists.

While success in high-level vision-language benchmarks might not require sensitivity to low-level geometric cues, we interpret anchoring as a failure at the representation-to-output language interface: Not necessarily an absence of geometric encoding, but a failure to express it in a graded form.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

3d ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup