With the burgeoning amount of data of image-text pairs and diversity of Vision-and-Language (V\&L) tasks, scholars have introduced an abundance of deep learning models in this research domain. Furthermore, in recent years, transfer learning has also shown tremendous success in Computer Vision for tasks such as Image Classification, Object Detection, etc., and in Natural Language Processing for Question Answering, Machine Translation, etc. Inheriting the spirit of Transfer Learning, research works in V\&L have devised multiple pretraining techniques on large-scale datasets in order to enhance the performance of downstream tasks. The aim of this article is to provide a comprehensive revision of contemporary V\&L pretraining models. In particular, we categorize and delineate pretraining approaches, along with the summary of state-of-the-art vision-and-language pretrained models. Moreover, a list of training datasets and downstream tasks is supplied to further polish the perspective into V\&L pretraining. Lastly, we decided to take a further step to discuss numerous directions for future research.
翻译:随着图像-文本对数据量的激增以及视觉与语言(V&L)任务的多样化,学者们在该研究领域引入了大量深度学习模型。此外,近年来,迁移学习在计算机视觉领域(如图像分类、目标检测等任务)和自然语言处理领域(如问答、机器翻译等任务)均展现出巨大成功。秉承迁移学习的思想,V&L领域的研究工作设计了多种针对大规模数据集的预训练技术,以提升下游任务的性能。本文旨在对当前V&L预训练模型进行全面综述。具体而言,我们对预训练方法进行了分类与阐述,并总结了前沿的视觉与语言预训练模型。此外,本文提供了训练数据集与下游任务的列表,以进一步深化对V&L预训练的认识。最后,我们进一步探讨了未来研究的多个方向。