Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training. To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts. We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.
翻译:视觉-语言模型(VLM)与人类学习者一样,经常接触到新的视觉概念,但它们在接触后如何将新颖的视觉参照映射到语言上,仍然在很大程度上未被探索,特别是当这些参照与预训练中的先验知识相矛盾时。为了研究这一点,我们提出了新颖视觉参照数据集(NVRD):包含19,176张图像,涵盖90个不同新颖程度的视觉概念,每个概念有原始对象最多20个逐渐扰动的版本,以探查泛化能力。与以往针对熟悉概念的视觉增强研究不同,NVRD完全由从头构建的新颖开放式刺激组成,模拟人类接触真正新概念的方式。我们评估了3个开源和2个闭源模型,并结合2,400个人类判断进行直接的人机比较,发现:(i)当模型与先验知识矛盾时,它们难以在上下文中习得新颖概念;(ii)虽然模型和人类对视觉扰动的敏感性相关,但模型显著过度泛化,将学到的标签扩展到人类拒绝的刺激上。我们贡献NVRD作为人类与机器视觉概念学习研究的语料库和基准。