Model interpretability has long been a hard problem for the AI community especially in the multimodal setting, where vision and language need to be aligned and reasoned at the same time. In this paper, we specifically focus on the problem of Visual Question Answering (VQA). While previous researches try to probe into the network structures of black-box multimodal models, we propose to tackle the problem from a different angle -- to treat interpretability as an explicit additional goal. Given an image and question, we argue that an interpretable VQA model should be able to tell what conclusions it can get from which part of the image, and show how each statement help to arrive at an answer. We introduce InterVQA: Interpretable-by-design VQA, where we design an explicit intermediate dynamic reasoning structure for VQA problems and enforce symbolic reasoning that only use the structure for final answer prediction to take place. InterVQA produces high-quality explicit intermediate reasoning steps, while maintaining similar to the state-of-the-art (sota) end-task performance.
翻译:模型可解释性长期以来一直是人工智能领域的难题,尤其是在视觉与语言需同时对齐和推理的多模态场景中。本文专门聚焦于视觉问答(VQA)问题。与以往尝试探析黑箱多模态模型网络结构的研究不同,我们提出从全新角度解决该问题——将可解释性作为显式附加目标。我们认为,面对给定图像和问题,一个可解释的VQA模型应能表明从图像哪些部分得出何种结论,并展示每条语句如何有助于最终答案。我们提出InterVQA:可解释性设计的VQA,即为VQA问题设计显式中间动态推理结构,并强制使用该结构进行符号化推理以完成最终答案预测。InterVQA能生成高质量的显式中间推理步骤,同时保持与当前最先进(sota)方法相近的最终任务性能。