Despite considerable recent progress in Visual Question Answering (VQA) models, inconsistent or contradictory answers continue to cast doubt on their true reasoning capabilities. However, most proposed methods use indirect strategies or strong assumptions on pairs of questions and answers to enforce model consistency. Instead, we propose a novel strategy intended to improve model performance by directly reducing logical inconsistencies. To do this, we introduce a new consistency loss term that can be used by a wide range of the VQA models and which relies on knowing the logical relation between pairs of questions and answers. While such information is typically not available in VQA datasets, we propose to infer these logical relations using a dedicated language model and use these in our proposed consistency loss function. We conduct extensive experiments on the VQA Introspect and DME datasets and show that our method brings improvements to state-of-the-art VQA models, while being robust across different architectures and settings.
翻译:尽管视觉问答(VQA)模型近期取得了显著进展,但其不一致或矛盾的答案仍对其真实推理能力存疑。然而,大多数现有方法采用间接策略或对问答对施加强假设来强制模型一致性。为此,我们提出一种旨在通过直接减少逻辑不一致性来提升模型性能的新策略。我们引入一个可被多种VQA模型使用的全新一致性损失项,该损失项依赖于问答对之间的逻辑关系。尽管此类信息在VQA数据集中通常不可用,但我们提出使用专用语言模型推断这些逻辑关系,并将其应用于所设计的一致性损失函数中。在VQA Introspect和DME数据集上的大量实验表明,我们的方法能够改进现有最优VQA模型的性能,同时在不同架构和设置下保持鲁棒性。