Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed that these models lack fine-grained understanding, such as the ability to count and recognize verbs, attributes, or relationships. The focus of this work is to study the understanding of spatial relations. This has been tackled previously using image-text matching (e.g., Visual Spatial Reasoning benchmark) or visual question answering (e.g., GQA or VQAv2), both showing poor performance and a large gap compared to human performance. In this work, we show qualitatively (using explainability tools) and quantitatively (using object detectors) that the poor object localization "grounding" ability of the models is a contributing factor to the poor image-text matching performance. We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses that combines the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause. We demonstrate the approach on representative VLMs (such as LXMERT, GPV, and MDETR) and compare and highlight their abilities to reason about spatial relationships.
翻译:大规模视觉与语言模型(VLM)通过图像-文本配对数据集训练以匹配图像与文本,在多项视觉与语言任务中展现出卓越的泛化能力。然而,近期研究表明,这些模型缺乏细粒度理解能力,例如对动词、属性、关系及计数功能的识别。本文聚焦于空间关系理解的研究。此前已有工作通过图像-文本匹配(如视觉空间推理基准)或视觉问答(如GQA或VQAv2)尝试解决该问题,但两者均表现欠佳,且与人类性能存在显著差距。本研究通过定性分析(可解释性工具)与定量分析(目标检测器)证明,模型对目标定位的"具象化"能力不足是导致图像-文本匹配性能低下的关键因素。我们提出一种替代性的细粒度组合式方法,通过融合与物体及其位置对应的名词短语具象化证据,对空间子句进行识别与排序。该方法在代表性VLM(如LXMERT、GPV及MDETR)上完成验证,并对比揭示了其在空间关系推理能力上的差异。