Visual question answering requires a system to provide an accurate natural language answer given an image and a natural language question. However, it is widely recognized that previous generic VQA methods often exhibit a tendency to memorize biases present in the training data rather than learning proper behaviors, such as grounding images before predicting answers. Therefore, these methods usually achieve high in-distribution but poor out-of-distribution performance. In recent years, various datasets and debiasing methods have been proposed to evaluate and enhance the VQA robustness, respectively. This paper provides the first comprehensive survey focused on this emerging fashion. Specifically, we first provide an overview of the development process of datasets from in-distribution and out-of-distribution perspectives. Then, we examine the evaluation metrics employed by these datasets. Thirdly, we propose a typology that presents the development process, similarities and differences, robustness comparison, and technical features of existing debiasing methods. Furthermore, we analyze and discuss the robustness of representative vision-and-language pre-training models on VQA. Finally, through a thorough review of the available literature and experimental analysis, we discuss the key areas for future research from various viewpoints.
翻译:视觉问答要求系统根据图像和自然语言问题提供准确的自然语言答案。然而,普遍发现以往通用VQA方法往往倾向于记忆训练数据中的偏差,而非学习恰当的行为(例如在预测答案前先对图像进行定位)。因此,这些方法通常能达到较高的分布内性能,但分布外性能较差。近年来,研究者已提出多种数据集和去偏方法,分别用于评估和增强VQA的鲁棒性。本文首次针对这一新兴研究方向进行了全面综述。具体而言,我们首先从分布内和分布外视角概述数据集的发展历程;其次梳理这些数据集采用的评估指标;第三提出一种分类体系,揭示现有去偏方法的发展过程、异同点、鲁棒性比较及技术特征;此外,我们分析并讨论了典型视觉-语言预训练模型在VQA上的鲁棒性;最后,通过系统性文献综述与实验分析,从不同视角探讨了未来研究的关键方向。