CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

With the flourishing of social media platforms, vision-language pre-training (VLP) recently has received great attention and many remarkable progresses have been achieved. The success of VLP largely benefits from the information complementation and enhancement between different modalities. However, most of recent studies focus on cross-modal contrastive learning (CMCL) to promote image-text alignment by pulling embeddings of positive sample pairs together while pushing those of negative pairs apart, which ignores the natural asymmetry property between different modalities and requires large-scale image-text corpus to achieve arduous progress. To mitigate this predicament, we propose CMAL, a Cross-Modal Associative Learning framework with anchor points detection and cross-modal associative learning for VLP. Specifically, we first respectively embed visual objects and textual tokens into separate hypersphere spaces to learn intra-modal hidden features, and then design a cross-modal associative prompt layer to perform anchor point masking and swap feature filling for constructing a hybrid cross-modal associative prompt. Afterwards, we exploit a unified semantic encoder to learn their cross-modal interactive features for context adaptation. Finally, we design an associative mapping classification layer to learn potential associative mappings between modalities at anchor points, within which we develop a fresh self-supervised associative mapping classification task to boost CMAL's performance. Experimental results verify the effectiveness of CMAL, showing that it achieves competitive performance against previous CMCL-based methods on four common downstream vision-and-language tasks, with significantly fewer corpus. Especially, CMAL obtains new state-of-the-art results on SNLI-VE and REC (testA).

翻译：随着社交媒体平台的蓬勃发展，视觉-语言预训练（VLP）近年来受到广泛关注并取得了许多显著进展。VLP的成功很大程度上得益于不同模态间的信息互补与增强。然而，近期大多数研究集中于跨模态对比学习（CMCL），通过拉近正样本对的嵌入表示并推远负样本对来促进图文对齐，这种方法忽视了不同模态间天然存在的不对称性，且需要大规模图文语料才能取得有限的进展。为缓解这一困境，我们提出CMAL——一种基于锚点检测与跨模态关联学习的VLP框架。具体而言，我们首先将视觉对象与文本词元分别嵌入独立超球空间以学习模态内隐特征，随后设计跨模态关联提示层，通过锚点掩蔽与交换特征填充构建混合跨模态关联提示。接着，我们利用统一的语义编码器学习其跨模态交互特征以实现上下文适配。最后，我们设计关联映射分类层以学习锚点处模态间的潜在关联映射，并创新性地提出自监督关联映射分类任务以提升CMAL性能。实验结果验证了CMAL的有效性：在四个常见下游视觉-语言任务中，CMAL以显著更少的语料量取得了与现有基于CMCL的方法相竞争的性能。特别是在SNLI-VE和REC（testA）数据集上，CMAL取得了新的最优结果。