Would you still call this Dax? Novel Visual References in VLMs and Humans
Quick Answer
This paper shows that The Novel Visual References Dataset (NVRD) introduces 19,176 images with 90 visual concepts to evaluate how vision-language models (VLMs) learn novel references.
Quick Take
The Novel Visual References Dataset (NVRD) introduces 19,176 images with 90 visual concepts to evaluate how vision-language models (VLMs) learn novel references. Findings reveal that models struggle with in-context learning when new concepts contradict prior knowledge, leading to significant overgeneralization compared to human judgments.
Key Points
- NVRD contains 19,176 images across 90 visual concepts with 20 perturbations each.
- Models evaluated include 3 open-source and 2 closed-source, alongside 2,400 human judgments.
- Models struggle to learn novel concepts when they contradict prior knowledge.
- While models and humans show sensitivity to visual changes, models overgeneralize significantly.
Article Content
From source RSS / original summaryarXiv:2606. 05409v1 Announce Type: new Abstract: Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training.
To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts.
We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.