Multimodal representation learning has shown promising improvements on various vision-language tasks. Most existing methods excel at building global-level alignment between vision and language while lacking effective fine-grained image-text interaction. In this paper, we propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations. Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover. The implicit target provides a unified and debiased objective for vision and language, where the model predicts latent multimodal representations of the unmasked input. The explicit target further enriches the multimodal representations by recovering high-level and semantically meaningful information: momentum visual features of image patches and concepts of word tokens. Through such a masked modeling process, our model not only learns fine-grained multimodal interaction, but also avoids the semantic gap between high-level representations and low- or mid-level prediction targets (e.g. image pixels), thus producing semantically rich multimodal representations that perform well on both zero-shot and fine-tuned settings. Our pre-trained model (named MAMO) achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.
翻译:多模态表示学习在各类视觉-语言任务中展现出显著性能提升。现有方法大多擅长建立视觉与语言之间的全局对齐,但缺乏有效的细粒度图像-文本交互。本文提出一种联合遮蔽多模态建模方法,用于学习细粒度多模态表示。该方法对图像-文本输入执行联合遮蔽操作,并整合隐式与显式目标来恢复被遮蔽信号。隐式目标为视觉和语言提供统一且去偏的优化目标,使模型能够预测未遮蔽输入的潜在多模态表示。显式目标则通过恢复高层语义信息(图像块的动量视觉特征与词元的概念表示)进一步丰富多模态表征。通过这种遮蔽建模过程,模型不仅能学习细粒度多模态交互,还可避免高层表示与低层或中层预测目标(如图像像素)之间的语义鸿沟,从而生成语义丰富的多模态表示,在零样本和微调设置下均表现优异。我们的预训练模型MAMO在图像-文本检索、视觉问答、视觉推理及弱监督视觉定位等下游视觉-语言任务上均达到最先进性能水平。