Visual Question Answering (VQA) has been a popular task that combines vision and language, with numerous relevant implementations in literature. Even though there are some attempts that approach explainability and robustness issues in VQA models, very few of them employ counterfactuals as a means of probing such challenges in a model-agnostic way. In this work, we propose a systematic method for explaining the behavior and investigating the robustness of VQA models through counterfactual perturbations. For this reason, we exploit structured knowledge bases to perform deterministic, optimal and controllable word-level replacements targeting the linguistic modality, and we then evaluate the model's response against such counterfactual inputs. Finally, we qualitatively extract local and global explanations based on counterfactual responses, which are ultimately proven insightful towards interpreting VQA model behaviors. By performing a variety of perturbation types, targeting different parts of speech of the input question, we gain insights to the reasoning of the model, through the comparison of its responses in different adversarial circumstances. Overall, we reveal possible biases in the decision-making process of the model, as well as expected and unexpected patterns, which impact its performance quantitatively and qualitatively, as indicated by our analysis.
翻译:视觉问答(VQA)已成为一个结合视觉与语言的流行任务,文献中已有大量相关实现方法。尽管已有部分研究尝试解决VQA模型的可解释性与鲁棒性问题,但极少有工作采用反事实分析这一与模型无关的探测手段。本文提出一种系统方法,通过反事实扰动来解释VQA模型的行为并研究其鲁棒性。为此,我们利用结构化知识库,针对语言模态执行确定、最优且可控的词级替换操作,进而评估模型对此类反事实输入的响应。最终,我们基于反事实响应定性提取局部与全局解释,这些解释对理解VQA模型行为具有重要启示。通过对输入问题中不同词性成分执行多种类型扰动,我们比较模型在不同对抗场景下的响应,从而洞悉其推理机制。总体而言,我们的分析揭示了模型决策过程中可能存在的偏差,以及影响模型定量与定性性能的预期与非预期模式。