With the recent progress in large-scale vision and language representation learning, Vision Language Pre-training (VLP) models have achieved promising improvements on various multi-modal downstream tasks. Albeit powerful, these models have not fully leveraged world knowledge to their advantage. A key challenge of knowledge-augmented VLP is the lack of clear connections between knowledge and multi-modal data. Moreover, not all knowledge present in images/texts is useful, therefore prior approaches often struggle to effectively integrate knowledge, visual, and textual information. In this study, we propose REtrieval-based knowledge Augmented Vision Language (REAVL), a novel knowledge-augmented pre-training framework to address the above issues. For the first time, we introduce a knowledge-aware self-supervised learning scheme that efficiently establishes the correspondence between knowledge and multi-modal data and identifies informative knowledge to improve the modeling of alignment and interactions between visual and textual modalities. By adaptively integrating informative knowledge with visual and textual information, REAVL achieves new state-of-the-art performance uniformly on knowledge-based vision-language understanding and multi-modal entity linking tasks, as well as competitive results on general vision-language tasks while only using 0.2% pre-training data of the best models. Our model shows strong sample efficiency and effective knowledge utilization.
翻译:随着大规模视觉与语言表征学习的近期进展,视觉语言预训练(VLP)模型已在多种多模态下游任务中取得了显著提升。尽管功能强大,这些模型尚未充分利用世界知识来增强其优势。知识增强型VLP的一个关键挑战在于知识与多模态数据之间明确关联的缺失。此外,并非图像/文本中的全部知识都具有实用性,因此先前的方法往往难以有效整合知识、视觉与文本信息。在本研究中,我们提出基于检索的知识增强视觉语言(REAVL)框架——一种新颖的知识增强预训练框架以应对上述问题。我们首次引入一种知识感知的自监督学习方案,该方案高效建立了知识与多模态数据的对应关系,并识别出信息性知识以改进视觉与文本模态间对齐与交互的建模。通过将信息性知识与视觉及文本信息自适应整合,REAVL在基于知识的视觉语言理解与多模态实体链接任务上一致实现了新的最优性能,同时在通用视觉语言任务上取得具有竞争力的结果,且仅使用最优模型0.2%的预训练数据。我们的模型展现出强大的样本效率与高效的知识利用能力。