Providing explanations for visual question answering (VQA) has gained much attention in research. However, most existing systems use separate models for predicting answers and providing explanations. We argue that training explanation models independently of the QA model makes the explanations less grounded and limits performance. To address this, we propose a multitask learning approach towards a Unified Model for more grounded and consistent generation of both Answers and Explanations (UMAE). To achieve this, we add artificial prompt tokens to training instances and finetune a multimodal encoder-decoder model on various VQA tasks. In our experiments, UMAE models surpass the prior SOTA answer accuracy on A-OKVQA by 10~15%, show competitive results on OK-VQA, achieve new SOTA explanation scores on A-OKVQA and VCR, and demonstrate promising out-of-domain performance on VQA-X.
翻译:为视觉问答(VQA)提供解释已在研究中备受关注。然而,现有系统大多使用独立模型分别预测答案和生成解释。我们认为,独立于问答模型训练解释模型会削弱解释的根基性并限制性能。为此,我们提出一种多任务学习方法,旨在构建能够更可靠、更一致地同时生成答案与解释的统一模型(UMAE)。具体实现中,我们向训练样本添加人工提示标记,并在多种VQA任务上微调多模态编码器-解码器模型。实验表明,UMAE模型在A-OKVQA数据集上超越此前最优答案准确率10%~15%,在OK-VQA上展现竞争性能,在A-OKVQA和VCR上取得新的解释评分最优结果,并在VQA-X上展现出有前景的跨域性能。