Would you still call this Dax? Novel Visual References in VLMs and Humans

arXiv cs.CV·Ada Defne T\"ur, Gaurav Kamath, Joyce Chai, Siva Reddy, Benno Krojer

2d ago

·~1 min·6/5/2026·en·1

Quick Answer

This paper shows that The Novel Visual References Dataset (NVRD) introduces 19,176 images with 90 visual concepts to evaluate how vision-language models (VLMs) learn novel references.

Quick Take

The Novel Visual References Dataset (NVRD) introduces 19,176 images with 90 visual concepts to evaluate how vision-language models (VLMs) learn novel references. Findings reveal that models struggle with in-context learning when new concepts contradict prior knowledge, leading to significant overgeneralization compared to human judgments.

Key Points

NVRD contains 19,176 images across 90 visual concepts with 20 perturbations each.
Models evaluated include 3 open-source and 2 closed-source, alongside 2,400 human judgments.
Models struggle to learn novel concepts when they contradict prior knowledge.
While models and humans show sensitivity to visual changes, models overgeneralize significantly.

Article Content

From source RSS / original summary

arXiv:2606. 05409v1 Announce Type: new Abstract: Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training.

To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts.

We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

2d ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup