Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed that these models lack fine-grained understanding, such as the ability to count and recognize verbs, attributes, or relationships. The focus of this work is to study the understanding of spatial relations. This has been tackled previously using image-text matching (e.g., Visual Spatial Reasoning benchmark) or visual question answering (e.g., GQA or VQAv2), both showing poor performance and a large gap compared to human performance. In this work, we show qualitatively (using explainability tools) and quantitatively (using object detectors) that the poor object localization "grounding" ability of the models is a contributing factor to the poor image-text matching performance. We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses that combines the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause. We demonstrate the approach on representative VLMs (such as LXMERT, GPV, and MDETR) and compare and highlight their abilities to reason about spatial relationships.

翻译：大规模视觉与语言模型（VLM）通过图像-文本配对数据集训练以匹配图像与文本，在多项视觉与语言任务中展现出卓越的泛化能力。然而，近期研究表明，这些模型缺乏细粒度理解能力，例如对动词、属性、关系及计数功能的识别。本文聚焦于空间关系理解的研究。此前已有工作通过图像-文本匹配（如视觉空间推理基准）或视觉问答（如GQA或VQAv2）尝试解决该问题，但两者均表现欠佳，且与人类性能存在显著差距。本研究通过定性分析（可解释性工具）与定量分析（目标检测器）证明，模型对目标定位的"具象化"能力不足是导致图像-文本匹配性能低下的关键因素。我们提出一种替代性的细粒度组合式方法，通过融合与物体及其位置对应的名词短语具象化证据，对空间子句进行识别与排序。该方法在代表性VLM（如LXMERT、GPV及MDETR）上完成验证，并对比揭示了其在空间关系推理能力上的差异。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/