The exploration of various vision-language tasks, such as visual captioning, visual question answering, and visual commonsense reasoning, is an important area in artificial intelligence and continuously attracts the research community's attention. Despite the improvements in overall performance, classic challenges still exist in vision-language tasks and hinder the development of this area. In recent years, the rise of pre-trained models is driving the research on vision-language tasks. Thanks to the massive scale of training data and model parameters, pre-trained models have exhibited excellent performance in numerous downstream tasks. Inspired by the powerful capabilities of pre-trained models, new paradigms have emerged to solve the classic challenges. Such methods have become mainstream in current research with increasing attention and rapid advances. In this paper, we present a comprehensive overview of how vision-language tasks benefit from pre-trained models. First, we review several main challenges in vision-language tasks and discuss the limitations of previous solutions before the era of pre-training. Next, we summarize the recent advances in incorporating pre-trained models to address the challenges in vision-language tasks. Finally, we analyze the potential risks associated with the inherent limitations of pre-trained models and discuss possible solutions, attempting to provide future research directions.
翻译:视觉语言任务(如图像描述生成、视觉问答和视觉常识推理)的探索是人工智能领域的重要研究方向,持续吸引着研究界的关注。尽管整体性能有所提升,视觉语言任务仍存在经典挑战,阻碍了该领域的发展。近年来,预训练模型的兴起正推动着视觉语言任务的研究。得益于海量的训练数据和模型参数,预训练模型在众多下游任务中展现出优异性能。受预训练模型强大能力的启发,解决经典挑战的新范式不断涌现。此类方法已成为当前研究的主流,受到日益增长的关注并取得快速进展。本文全面综述了视觉语言任务如何从预训练模型中受益。首先,我们回顾了视觉语言任务中的若干主要挑战,并讨论了预训练时代之前传统解决方案的局限性。其次,我们总结了近期利用预训练模型应对视觉语言任务挑战的研究进展。最后,我们分析了预训练模型固有局限性带来的潜在风险,并探讨了可能的解决方案,试图为未来研究方向提供参考。