Visual Grounding (VG) methods in Visual Question Answering (VQA) attempt to improve VQA performance by strengthening a model's reliance on question-relevant visual information. The presence of such relevant information in the visual input is typically assumed in training and testing. This assumption, however, is inherently flawed when dealing with imperfect image representations common in large-scale VQA, where the information carried by visual features frequently deviates from expected ground-truth contents. As a result, training and testing of VG-methods is performed with largely inaccurate data, which obstructs proper assessment of their potential benefits. In this study, we demonstrate that current evaluation schemes for VG-methods are problematic due to the flawed assumption of availability of relevant visual information. Our experiments show that these methods can be much more effective when evaluation conditions are corrected. Code is provided on GitHub.
翻译:视觉问答(VQA)中的视觉定位(VG)方法试图通过增强模型对与问题相关的视觉信息的依赖来提升VQA性能。在训练和测试中,通常假设视觉输入中存在此类相关信息。然而,当处理大规模VQA中常见的不完美图像表示时,这一假设存在固有缺陷——视觉特征所携带的信息往往偏离预期的真实内容。因此,VG方法的训练与测试基于大量不准确的数据进行,这阻碍了对其潜在优势的准确评估。本研究表明,由于对相关信息可用性的错误假设,当前VG方法的评估机制存在问题。实验证明,当评估条件得到修正时,这些方法能够发挥更大的效用。代码已在GitHub上提供。