Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception
Quick Answer
This paper shows that Vision-Language Models (VLMs) struggle with slant-from-texture perception, showing significant anchoring biases at specific angles.
Quick Take
Vision-Language Models (VLMs) struggle with slant-from-texture perception, showing significant anchoring biases at specific angles. Despite some improvement through supervised fine-tuning, VLMs like CNNs fail to express geometric cues in a graded manner, impacting their performance in nuanced visual tasks.
Key Points
- VLMs predict slant primarily at anchors like 0°, ±25°, and ±45°.
- Zero-shot and in-context prompting reveal significant failures in slant perception.
- Supervised fine-tuning partially addresses the issue, but anchoring biases remain.
- High-level vision-language benchmarks may not require low-level geometric sensitivity.
- The study highlights a gap in the representation-to-output language interface.
Article Excerpt
From source RSS / original summaryarXiv:2606. 06714v1 Announce Type: new Abstract: Human perception of surface slant from texture exhibits systematic, graded biases that emerge reliably in psychophysical experiments. Prior work showed that unsupervised CNNs reproduce several human-like biases, while supervised CNNs do not. Do Vision-Language Models (VLMs) exhibit similar competences?
Across multiple VLM families and model scales, zero-shot and in-context prompting both produce distinctive failures: slant is predicted at only a small set of anchors (e. g. , 0\degree, $\pm$25\degree, $\pm$45\degree) with little dependence on stimulus field of view, optical slant, or surface curvature. Supervised fine-tuning partially remediates the failure, but residual anchoring persists.
While success in high-level vision-language benchmarks might not require sensitivity to low-level geometric cues, we interpret anchoring as a failure at the representation-to-output language interface: Not necessarily an absence of geometric encoding, but a failure to express it in a graded form.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.