Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems primarily aim to measure a system's reliance on relevant parts of the image when inferring an answer to the given question. Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely. Although inference capabilities of VQA models are often illustrated by a few qualitative illustrations, most systems are not quantitatively assessed for their VG properties. We believe, an easily calculated criterion for meaningfully measuring a system's VG can help remedy this shortcoming, as well as add another valuable dimension to model evaluations and analysis. To this end, we propose a new VG metric that captures if a model a) identifies question-relevant objects in the scene, and b) actually relies on the information contained in the relevant objects when producing its answer, i.e., if its visual grounding is both "faithful" and "plausible". Our metric, called "Faithful and Plausible Visual Grounding" (FPVG), is straightforward to determine for most VQA model designs. We give a detailed description of FPVG and evaluate several reference systems spanning various VQA architectures. Code to support the metric calculations on the GQA data set is available on GitHub.
翻译:在视觉问答(VQA)系统中,视觉定位(VG)指标主要旨在衡量系统在推理给定问题答案时对图像相关部分的依赖程度。缺乏视觉定位是当前最先进VQA系统中的常见问题,表现为过度依赖无关图像部分或完全忽视视觉模态。尽管VQA模型的推理能力通常通过少量定性示例进行说明,但大多数系统并未对其视觉定位属性进行定量评估。我们认为,一个易于计算的、具有意义的VG测量标准有助于弥补这一缺陷,并为模型评估与分析增添另一个有价值的维度。为此,我们提出一种新的VG指标,用于捕捉模型是否:(a)识别场景中与问题相关的物体,以及(b)在生成答案时实际依赖相关物体所含信息,即其视觉定位是否兼具"忠实性"与"可信性"。我们的指标称为"忠实与可信视觉定位"(FPVG),对于大多数VQA模型设计而言易于确定。我们详细描述了FPVG,并评估了多种VQA架构的若干参考系统。支持在GQA数据集上进行指标计算的代码已发布于GitHub。