VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images
Quick Take
VisAnalog introduces a diagnostic suite for testing visual concept transfer in natural images, revealing that existing VLMs show significantly lower accuracy than human performance, especially as transformation complexity increases. The benchmark includes 617 validated questions on various transformations, highlighting that relation inference is a major bottleneck.
Key Points
- VisAnalog tests visual concept learning through transformations on natural images.
- The benchmark includes 617 questions with one- to four-step transformations.
- End-to-end accuracy of VLMs is significantly lower than human performance.
- Relation inference is the dominant bottleneck in model performance.
- Dataset available publicly at Hugging Face for further research.
Article Content
From source RSS / original summaryarXiv:2605. 23141v1 Announce Type: new Abstract: A useful test of visual concept learning is not just whether a model can recognize a concept in a single image, but whether it can preserve and manipulate concept-level properties under transformation and transfer them to new scenes. We introduce VisAnalog, a controlled suite for this setting on natural images. Each example instantiates $A\! :\! B::C\! :\,?
$: images $B$ and a hidden target image $D$ are produced by applying the same deterministic transformation sequence to source images $A$ and $C$. Given $A$, $B$, and $C$, a model must answer a multiple-choice question about $D$. The benchmark contains 617 human-validated questions spanning one- to four-step transformations such as zoom, quadrant swap, rotation, flip, and hue rotation.
Across strong proprietary and open-source VLMs, end-to-end accuracy is substantially lower than oracle accuracy when $D$ is directly shown, and degrades sharply as transformation depth increases, while human performance remains near the ceiling.
A program-conditioned evaluation further separates failures of relation inference from failures of transformation application, showing that inferring the visual relation from $A \rightarrow B$ is the dominant bottleneck, with additional application errors emerging on harder multi-step cases. The dataset is publicly available at https://huggingface. co/datasets/zli99/VisAnalog.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.