While human speakers use a variety of different expressions when describing the same object in an image, giving rise to a distribution of plausible labels driven by pragmatic constraints, the extent to which current Vision \& Language Large Language Models (VLLMs) can mimic this crucial feature of language use is an open question. This applies to common, everyday objects, but it is particularly interesting for uncommon or novel objects for which a category label may be lacking or fuzzy. Furthermore, humans show clear production preferences for highly context-sensitive expressions, such as the quantifiers `few' or `most'. In our work, we evaluate VLLMs (FROMAGe, BLIP-2, LLaVA) on three categories (nouns, attributes, and quantifiers) where humans show great subjective variability concerning the distribution over plausible labels, using datasets and resources mostly under-explored in previous work. Our results reveal mixed evidence on the ability of VLLMs to capture human naming preferences, with all models failing in tasks that require high-level reasoning such as assigning quantifiers.
翻译:人类说话者在描述图像中的同一对象时,会使用多种不同的表达方式,从而在语用约束下产生一系列可能的标签分布。当前视觉与语言大型语言模型(VLLMs)能否模仿这一语言使用的关键特征,仍是一个开放问题。这不仅适用于常见的日常对象,对于缺乏明确类别标签或类别模糊的新颖对象而言尤为有趣。此外,人类在对高度情境敏感的表达(如量词“少数”或“大多数”)时表现出明确的生产偏好。在我们的研究中,我们利用先前工作中探索不足的数据集与资源,在人类对可能标签分布表现出较大主观差异的三个类别(名词、属性和量词)上评估了VLLMs(FROMAGe、BLIP-2、LLaVA)。结果显示,VLLMs在捕捉人类命名偏好方面的能力存在混合证据:所有模型均未能完成需要高层推理的任务,例如量词分配。