With recent progress in large-scale vision and language representation learning, Vision Language Pretraining (VLP) models have achieved promising improvements on various multi-modal downstream tasks. Albeit powerful, these pre-training models still do not take advantage of world knowledge, which is implicit in multi-modal data but comprises abundant and complementary information. In this work, we propose a REtrieval-based knowledge Augmented Vision Language Pre-training model (REAVL), which retrieves world knowledge from knowledge graphs (KGs) and incorporates them in vision-language pre-training. REAVL has two core components: a knowledge retriever that retrieves knowledge given multi-modal data, and a knowledge-augmented model that fuses multi-modal data and knowledge. By novelly unifying four knowledge-aware self-supervised tasks, REAVL promotes the mutual integration of multi-modal data and knowledge by fusing explicit knowledge with vision-language pairs for masked multi-modal data modeling and KG relational reasoning. Empirical experiments show that REAVL achieves new state-of-the-art performance uniformly on knowledge-based vision-language understanding and multimodal entity linking tasks, and competitive results on general vision-language tasks while only using 0.2% pre-training data of the best models.
翻译:随着大规模视觉与语言表征学习的进展,视觉语言预训练(VLP)模型在各种多模态下游任务中取得了令人瞩目的提升。尽管功能强大,这些预训练模型尚未充分利用世界知识——这些知识隐含于多模态数据中,但包含丰富且互补的信息。本研究提出基于检索的知识增强视觉语言预训练模型(REAVL),该模型从知识图谱(KGs)中检索世界知识并将其融入视觉语言预训练。REAVL包含两大核心组件:一个知识检索器,根据多模态数据检索相关知识;一个知识增强模型,融合多模态数据与知识。通过创新性地统一四种知识感知自监督任务,REAVL将显式知识与视觉语言对相结合,用于掩码多模态数据建模与知识图谱关系推理,从而促进多模态数据与知识的相互融合。实验结果表明,REAVL在基于知识的视觉语言理解与多模态实体链接任务上全面达到最新最优性能,同时在仅使用最优模型0.2%预训练数据的情况下,在通用视觉语言任务上取得具有竞争力的结果。