Global and Local Semantic Completion Learning for Vision-Language Pre-training

Cross-modal alignment plays a crucial role in vision-language pre-training (VLP) models, enabling them to capture meaningful associations across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-local alignment, i.e., associations between image patches and text tokens. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations to local features of the other modality. Therefore, in this paper, we propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously. Specifically, the GLSCL task complements the missing semantics of masked data and recovers global and local features by cross-modal interactions. Our GLSCL consists of masked global semantic completion (MGSC) and masked local token completion (MLTC). MGSC promotes learning more representative global features which have a great impact on the performance of downstream tasks, and MLTC can further enhance accurate comprehension on multimodal data. Moreover, we present a flexible vision encoder, enabling our model to simultaneously perform image-text and video-text multimodal tasks. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.

翻译：跨模态对齐在视觉-语言预训练（VLP）模型中起着关键作用，使其能够捕捉不同模态之间的有意义的关联。为此，受自然语言处理预训练领域中掩码语言建模（MLM）任务成功的启发，研究者们提出了众多用于VLP的掩码建模任务，以进一步促进跨模态交互。以往掩码建模任务的核心思想是基于可见上下文聚焦于重构被掩码的标记，从而学习局部-局部对齐，即图像块与文本标记之间的关联。然而，大多数方法较少关注由被掩码数据生成的全局语义特征，导致全局表示与另一模态局部特征之间的跨模态对齐能力受限。因此，本文提出了一种新颖的全局与局部语义补全学习（GLSCL）任务，以同时促进全局-局部对齐和局部-局部对齐。具体而言，GLSCL任务通过跨模态交互补全被掩码数据的缺失语义，并恢复全局与局部特征。我们的GLSCL由掩码全局语义补全（MGSC）和掩码局部标记补全（MLTC）组成。MGSC促进学习更具代表性的全局特征，这对下游任务的性能具有重要影响；MLTC则可进一步增强对多模态数据的精确理解。此外，我们提出了一种灵活的视觉编码器，使模型能够同时执行图像-文本和视频-文本多模态任务。实验结果表明，我们的方法在各类视觉-语言基准测试（如视觉问答、图像-文本检索和视频-文本检索）中均取得了最先进的性能。