Spatial relations are a basic part of human cognition. However, they are expressed in natural language in a variety of ways, and previous work has suggested that current vision-and-language models (VLMs) struggle to capture relational information. In this paper, we present Visual Spatial Reasoning (VSR), a dataset containing more than 10k natural text-image pairs with 66 types of spatial relations in English (such as: under, in front of, and facing). While using a seemingly simple annotation format, we show how the dataset includes challenging linguistic phenomena, such as varying reference frames. We demonstrate a large gap between human and model performance: the human ceiling is above 95%, while state-of-the-art models only achieve around 70%. We observe that VLMs' by-relation performances have little correlation with the number of training examples and the tested models are in general incapable of recognising relations concerning the orientations of objects.
翻译:空间关系是人类认知的基本组成部分。然而,它们在自然语言中以多种方式表达,且先前的研究表明,当前的视觉语言模型(VLMs)难以捕捉关系信息。在本文中,我们提出了视觉空间推理(Visual Spatial Reasoning,VSR)数据集,其中包含超过1万个自然文本-图像对,涵盖66种英语空间关系类型(例如:下方、前方和面向)。尽管采用了看似简单的标注格式,我们展示了该数据集如何包含具有挑战性的语言现象,例如不同的参考框架。我们发现人类与模型性能之间存在巨大差距:人类表现上限超过95%,而最先进的模型仅能达到约70%。我们观察到,VLMs在不同关系上的性能与训练样本数量相关性较弱,并且被测试的模型总体上无法识别涉及物体方向的关系。