Spatial relations are a basic part of human cognition. However, they are expressed in natural language in a variety of ways, and previous work has suggested that current vision-and-language models (VLMs) struggle to capture relational information. In this paper, we present Visual Spatial Reasoning (VSR), a dataset containing more than 10k natural text-image pairs with 65 types of spatial relations in English (such as: under, in front of, and facing). While using a seemingly simple annotation format, we show how the dataset includes challenging linguistic phenomena, such as varying reference frames. We demonstrate a large gap between human and model performance: the human ceiling is above 95%, while state-of-the-art models only achieve around 70%. We observe that VLMs' by-relation performances have little correlation with the number of training examples and the tested models are in general incapable of recognising relations concerning the orientations of objects.
翻译:空间关系是人类认知的基本组成部分。然而,空间关系在自然语言中有着多样的表达方式,且先前研究表明,当前的视觉与语言模型(VLMs)在捕捉关系信息方面存在困难。本文提出视觉空间推理(VSR)数据集,该数据集包含超过1万个自然文本-图像对,涵盖英语中65种空间关系类型(如下方、前方、朝向)。尽管采用看似简单的标注格式,我们展示了该数据集如何包含具有挑战性的语言现象,例如变化的参考框架。我们揭示了人类与模型性能之间的巨大差距:人类上限超过95%,而当前最先进的模型仅达到约70%。此外,我们观察到VLMs在各种关系上的表现与训练样本数量关联甚微,且被测试模型普遍无法识别涉及物体朝向的关系。