Visual Grounding (VG) in VQA refers to a model's proclivity to infer answers based on question-relevant image regions. Conceptually, VG identifies as an axiomatic requirement of the VQA task. In practice, however, DNN-based VQA models are notorious for bypassing VG by way of shortcut (SC) learning without suffering obvious performance losses in standard benchmarks. To uncover the impact of SC learning, Out-of-Distribution (OOD) tests have been proposed that expose a lack of VG with low accuracy. These tests have since been at the center of VG research and served as basis for various investigations into VG's impact on accuracy. However, the role of VG in VQA still remains not fully understood and has not yet been properly formalized. In this work, we seek to clarify VG's role in VQA by formalizing it on a conceptual level. We propose a novel theoretical framework called "Visually Grounded Reasoning" (VGR) that uses the concepts of VG and Reasoning to describe VQA inference in ideal OOD testing. By consolidating fundamental insights into VG's role in VQA, VGR helps to reveal rampant VG-related SC exploitation in OOD testing, which explains why the relationship between VG and OOD accuracy has been difficult to define. Finally, we propose an approach to create OOD tests that properly emphasize a requirement for VG, and show how to improve performance on them.
翻译:视觉问答中的视觉基础指模型基于问题相关图像区域推断答案的倾向性。从概念上,视觉基础应被视为VQA任务的一项公理性要求。然而在实践中,基于深度神经网络的VQA模型常通过捷径学习规避视觉基础,却在标准基准测试中未表现出明显性能损失。为揭示捷径学习的影响,研究者提出了分布外测试方法,通过较低准确率暴露视觉基础的缺失。这些测试已成为视觉基础研究的核心,并为探究视觉基础对准确率的影响提供了多种研究基础。但视觉基础在VQA中的作用仍未得到充分理解,也尚未被恰当形式化。本研究通过在概念层面形式化视觉基础,旨在澄清其在VQA中的角色。我们提出名为"视觉基础推理"的新型理论框架,该框架运用视觉基础与推理的概念来描述理想分布外测试中的VQA推断过程。通过整合关于视觉基础作用的基本见解,VGR框架有助于揭示分布外测试中普遍存在的视觉基础相关捷径利用现象,这解释了为何视觉基础与分布外准确率的关系始终难以界定。最后,我们提出构建能恰当强调视觉基础要求的分布外测试方法,并展示如何提升此类测试的性能。