With pre-training of vision-and-language models (VLMs) on large-scale datasets of image-text pairs, several recent works showed that these pre-trained models lack fine-grained understanding, such as the ability to count and recognize verbs, attributes, or relationships. The focus of this work is to study the ability of these models to understand spatial relations. Previously, this has been tackled using image-text matching (e.g., Visual Spatial Reasoning benchmark) or visual question answering (e.g., GQA or VQAv2), both showing poor performance and a large gap compared to human performance. In this work, we use explainability tools to understand the causes of poor performance better and present an alternative fine-grained, compositional approach for ranking spatial clauses. We combine the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause. We demonstrate the approach on representative VLMs (such as LXMERT, GPV, and MDETR) and compare and highlight their abilities to reason about spatial relationships.
翻译:随着视觉-语言模型在大规模图文配对数据集上的预训练,近期多项研究表明,这些预训练模型缺乏细粒度理解能力,例如计数、识别动词、属性或关系的能力。本研究聚焦于探索此类模型理解空间关系的能力。此前,该问题主要通过图文匹配(如视觉空间推理基准)或视觉问答(如GQA或VQAv2)进行研究,但两者均表现欠佳,且与人类性能存在显著差距。本研究借助可解释性工具深入剖析性能低下的原因,并提出一种替代性的细粒度组合化方法用于空间子句排序。我们通过整合与物体及其位置对应的名词短语的定位证据,计算空间子句的最终排序。该方法在代表性视觉语言模型(如LXMERT、GPV和MDETR)上进行了验证,并对比分析了它们对空间关系推理的能力特征。