Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform. A relatively recent body of research has adapted the pretrained transformer architecture introduced in \citet{vaswani2017attention} to vision language modeling. Transformer models have greatly improved performance and versatility over previous vision language models. They do so by pretraining models on a large generic datasets and transferring their learning to new tasks with minor changes in architecture and parameter values. This type of transfer learning has become the standard modeling practice in both natural language processing and computer vision. Vision language transformers offer the promise of producing similar advancements in tasks which require both vision and language. In this paper, we provide a broad synthesis of the currently available research on vision language transformer models and offer some analysis of their strengths, limitations and some open questions that remain.
翻译:视觉语言任务,例如回答关于图像的问题或生成描述图像的标题,对计算机而言是颇具挑战的任务。较新的研究趋势是将《Vaswani等人,2017》提出的预训练Transformer架构应用于视觉语言建模。相较于以往的视觉语言模型,Transformer模型大幅提升了性能与通用性。其实现方式是在大规模通用数据集上预训练模型,并通过微调架构与参数值将其学习能力迁移至新任务。此类迁移学习已成为自然语言处理与计算机视觉领域的标准建模实践。视觉语言Transformer有望在需要视觉与语言协同的任务中实现类似突破。本文对当前视觉语言Transformer模型的研究进行广泛综述,分析其优势、局限性及尚未解决的开放性问题。