How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

The exploration of various vision-language tasks, such as visual captioning, visual question answering, and visual commonsense reasoning, is an important area in artificial intelligence and continuously attracts the research community's attention. Despite the improvements in overall performance, classic challenges still exist in vision-language tasks and hinder the development of this area. In recent years, the rise of pre-trained models is driving the research on vision-language tasks. Thanks to the massive scale of training data and model parameters, pre-trained models have exhibited excellent performance in numerous downstream tasks. Inspired by the powerful capabilities of pre-trained models, new paradigms have emerged to solve the classic challenges. Such methods have become mainstream in current research with increasing attention and rapid advances. In this paper, we present a comprehensive overview of how vision-language tasks benefit from pre-trained models. First, we review several main challenges in vision-language tasks and discuss the limitations of previous solutions before the era of pre-training. Next, we summarize the recent advances in incorporating pre-trained models to address the challenges in vision-language tasks. Finally, we analyze the potential risks associated with the inherent limitations of pre-trained models and discuss possible solutions, attempting to provide future research directions.

翻译：视觉语言任务（如图像描述生成、视觉问答和视觉常识推理）的探索是人工智能领域的重要研究方向，持续吸引着研究界的关注。尽管整体性能有所提升，视觉语言任务仍存在经典挑战，阻碍了该领域的发展。近年来，预训练模型的兴起正推动着视觉语言任务的研究。得益于海量的训练数据和模型参数，预训练模型在众多下游任务中展现出优异性能。受预训练模型强大能力的启发，解决经典挑战的新范式不断涌现。此类方法已成为当前研究的主流，受到日益增长的关注并取得快速进展。本文全面综述了视觉语言任务如何从预训练模型中受益。首先，我们回顾了视觉语言任务中的若干主要挑战，并讨论了预训练时代之前传统解决方案的局限性。其次，我们总结了近期利用预训练模型应对视觉语言任务挑战的研究进展。最后，我们分析了预训练模型固有局限性带来的潜在风险，并探讨了可能的解决方案，试图为未来研究方向提供参考。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日