The field of visual question answering (VQA) has recently seen a surge in research focused on providing explanations for predicted answers. However, current systems mostly rely on separate models to predict answers and generate explanations, leading to less grounded and frequently inconsistent results. To address this, we propose a multitask learning approach towards a Unified Model for Answer and Explanation generation (UMAE). Our approach involves the addition of artificial prompt tokens to training data and fine-tuning a multimodal encoder-decoder model on a variety of VQA-related tasks. In our experiments, UMAE models surpass the prior state-of-the-art answer accuracy on A-OKVQA by 10~15%, show competitive results on OK-VQA, achieve new state-of-the-art explanation scores on A-OKVQA and VCR, and demonstrate promising out-of-domain performance on VQA-X.
翻译:视觉问答领域近期涌现了大量聚焦于为预测答案提供解释的研究。然而,当前系统主要依赖独立模型分别预测答案和生成解释,导致结果缺乏基础性且常不一致。为解决此问题,我们提出一种多任务学习方法,构建统一答案与解释生成模型(UMAE)。该方法通过在训练数据中添加人工提示标记,并在多种VQA相关任务上微调多模态编码器-解码器模型实现。实验表明,UMAE模型在A-OKVQA上答案准确率超越先前最优方法10%~15%,在OK-VQA上表现具有竞争力,在A-OKVQA和VCR上取得新的解释评分最优结果,并在VQA-X上展现出有前景的跨领域性能。