While human speakers use a variety of different expressions when describing the same object in an image, giving rise to a distribution of plausible labels driven by pragmatic constraints, the extent to which current Vision \& Language Large Language Models (VLLMs) can mimic this crucial feature of language use is an open question. This applies to common, everyday objects, but it is particularly interesting for uncommon or novel objects for which a category label may be lacking or fuzzy. Furthermore, humans show clear production preferences for highly context-sensitive expressions, such as the quantifiers `few' or `most'. In our work, we evaluate VLLMs (FROMAGe, BLIP-2, LLaVA) on three categories (nouns, attributes, and quantifiers) where humans show great subjective variability concerning the distribution over plausible labels, using datasets and resources mostly under-explored in previous work. Our results reveal mixed evidence on the ability of VLLMs to capture human naming preferences, with all models failing in tasks that require high-level reasoning such as assigning quantifiers.
翻译:尽管人类说话者在描述图像中同一对象时会使用各种不同的表达方式,从而产生由语用约束驱动的合理标签分布,但当前的视觉与语言大型语言模型(VLLMs)能否模仿语言使用的这一关键特征仍是一个未解问题。这一点适用于常见的日常对象,但对于可能缺乏或模糊类别标签的不常见或新颖对象而言尤为有趣。此外,人类在高度依赖语境的表达方式(如量词“少数”或“多数”)上表现出明确的产出偏好。在我们的研究中,我们利用前人研究较少涉及的数据库和资源,在人类对合理标签分布表现出较大主观变异性的三个类别(名词、属性、量词)上评估了VLLMs(FROMAGe、BLIP-2、LLaVA)。我们的结果揭示了VLLMs在捕捉人类命名偏好方面的能力证据不一,所有模型在需要高级推理的任务(如分配量词)中均表现失败。