We present a novel multimodal interpretable VQA model that can answer the question more accurately and generate diverse explanations. Although researchers have proposed several methods that can generate human-readable and fine-grained natural language sentences to explain a model's decision, these methods have focused solely on the information in the image. Ideally, the model should refer to various information inside and outside the image to correctly generate explanations, just as we use background knowledge daily. The proposed method incorporates information from outside knowledge and multiple image captions to increase the diversity of information available to the model. The contribution of this paper is to construct an interpretable visual question answering model using multimodal inputs to improve the rationality of generated results. Experimental results show that our model can outperform state-of-the-art methods regarding answer accuracy and explanation rationality.
翻译:我们提出了一种新颖的多模态可解释性VQA模型,该模型能够更准确地回答问题并生成多样化的解释。尽管研究者已提出多种方法,通过生成人类可读且细粒度的自然语言句子来解释模型决策,但这些方法仅聚焦于图像中的信息。理想情况下,模型应像我们日常运用背景知识一样,引用图像内外的多种信息来正确生成解释。所提方法通过整合外部知识与多图像描述信息,提升了模型可获取信息的多样性。本文的贡献在于构建了利用多模态输入的可解释性视觉问答模型,以提升生成结果的合理性。实验结果表明,本模型在答案准确性和解释合理性方面均能超越现有最优方法。