Visual Question Answering (VQA) is a challenging task of predicting the answer to a question about the content of an image. It requires deep understanding of both the textual question and visual image. Prior works directly evaluate the answering models by simply calculating the accuracy of the predicted answers. However, the inner reasoning behind the prediction is disregarded in such a "black box" system, and we do not even know if one can trust the predictions. In some cases, the models still get the correct answers even when they focus on irrelevant visual regions or textual tokens, which makes the models unreliable and illogical. To generate both visual and textual rationales next to the predicted answer to the given image/question pair, we propose Convincing Rationales for VQA, CRVQA. Considering the extra annotations brought by the new outputs, {CRVQA} is trained and evaluated by samples converted from some existing VQA datasets and their visual labels. The extensive experiments demonstrate that the visual and textual rationales support the prediction of the answers, and further improve the accuracy. Furthermore, {CRVQA} achieves competitive performance on generic VQA datatsets in the zero-shot evaluation setting. The dataset and source code will be released under https://github.com/lik1996/CRVQA2024.
翻译:视觉问答(Visual Question Answering, VQA)是一项根据图像内容预测问题答案的具有挑战性的任务,需要深入理解文本问题与视觉图像。现有工作通常仅通过计算预测答案的准确率来直接评估回答模型,然而这种“黑箱”系统忽略了预测背后的内在推理过程,我们甚至无法判断预测结果是否可信。在某些情况下,即使模型关注无关的视觉区域或文本标记,仍能给出正确答案,这使得模型不可靠且缺乏逻辑性。为了在给定图像/问题对的预测答案旁同时生成视觉与文本依据,我们提出了面向VQA的可信依据(Convincing Rationales for VQA, CRVQA)。考虑到新输出形式所需的额外标注,CRVQA通过从现有VQA数据集及其视觉标签转换的样本进行训练与评估。大量实验表明,视觉与文本依据能够支撑答案预测,并进一步提升准确率。此外,CRVQA在零样本评估设置下,在通用VQA数据集上取得了具有竞争力的性能。数据集与源代码将在 https://github.com/lik1996/CRVQA2024 发布。