In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method, along with common V+L alignment losses, achieves state-of-the-art performance in the regime of millions of pre-training data. Also, we outperforms the other competitors by a significant margin in limited data scenarios.
翻译:本文研究如何在视觉与语言(V+L)表示学习中利用掩码信号建模。不同于独立发展掩码语言建模(MLM)和掩码图像建模(MIM),我们提出构建联合掩码视觉与语言建模,通过另一模态的帮助重建某一模态的掩码信号。这一方法的动机源于图像-文本配对数据的本质:图像和文本以不同形式传达几乎相同的信息。基于另一模态条件进行掩码信号重建,还能隐式学习语言标记与图像块之间的跨模态对齐。我们在多种V+L任务上的实验表明,所提方法结合常规V+L对齐损失,在数百万预训练数据规模下达到了最先进的性能。此外,在数据稀缺场景中,我们的方法以显著优势超越其他竞争者。