While human speakers use a variety of different expressions when describing the same object in an image, giving rise to a distribution of plausible labels driven by pragmatic constraints, the extent to which current Vision & Language Large Language Models (VLLMs) can mimic this crucial feature of language use is an open question. This applies to common, everyday objects, but it is particularly interesting for uncommon or novel objects for which a category label may be lacking or fuzzy. Furthermore, similar patterns of variation are observed among human speakers for highly context-sensitive expressions, such as the quantifiers 'few' or 'most'. In our work, we evaluate VLLMs (FROMAGe, BLIP-2, LLaVA) on three categories (nouns, attributes, and quantifiers) where humans show great subjective variability concerning the distribution over plausible labels, using datasets and resources mostly under-explored in previous work. Our results reveal mixed evidence on the ability of VLLMs to capture human naming preferences at generation time: while some models are good at mimicking human distributions for nouns and attributes, all of them fail to assign quantifiers, a task that requires more accurate, high-level reasoning.
翻译:尽管人类说话者在描述同一图像中的物体时会使用多种不同的表达方式,从而产生由语用约束驱动的合理标签分布,但当前的视觉与语言大语言模型(VLLMs)能在多大程度上模仿语言使用的这一关键特征,仍是一个悬而未决的问题。这一现象适用于常见的日常物体,但对于缺乏或模糊类别标签的罕见或新颖物体而言尤为值得关注。此外,人类说话者在处理高度语境敏感的表达式(如量词“少数”或“大多数”)时也表现出类似的变异模式。在本研究中,我们利用先前工作中较少探索的数据集和资源,评估了VLLMs(FROMAGe、BLIP-2、LLaVA)在名词、属性和量词这三个类别上的表现——在这些类别中,人类对合理标签的分布表现出显著的主观差异性。我们的研究结果揭示了VLLMs在生成时捕捉人类命名偏好能力的复杂证据:虽然某些模型在模拟名词和属性的人类分布方面表现良好,但所有模型均未能准确分配量词——这项任务需要更精确的高层次推理能力。