Connecting Multi-modal Contrastive Representations

Multi-modal Contrastive Representation (MCR) learning aims to encode different modalities into a semantically aligned shared space. This paradigm shows remarkable generalization ability on numerous downstream tasks across various modalities. However, the reliance on massive high-quality data pairs limits its further development on more modalities. This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR). Specifically, given two existing MCRs pre-trained on (A, B) and (B, C) modality pairs, we project them to a new space and use the data from the overlapping modality B to aligning the two MCRs in the new space. Meanwhile, since the modality pairs (A, B) and (B, C) are already aligned within each MCR, the connection learned by overlapping modality can also be transferred to non-overlapping modality pair (A, C). To unleash the potential of C-MCR, we further introduce a semantic-enhanced inter- and intra-MCR connection method. We first enhance the semantic consistency and completion of embeddings across different modalities for more robust alignment. Then we utilize the inter-MCR alignment to establish the connection, and employ the intra-MCR alignment to better maintain the connection for inputs from non-overlapping modalities. We take the field of audio-visual contrastive learning as an example to demonstrate the effectiveness of C-MCR. We connect pre-trained CLIP and CLAP models via texts to derive audio-visual contrastive representations. Remarkably, without using any paired audio-visual data and further tuning, C-MCR achieves state-of-the-art performance on six datasets across three audio-visual downstream tasks.

翻译：多模态对比表示学习旨在将不同模态编码到语义对齐的共享空间中。该范式在涉及多种模态的众多下游任务中展现出显著的泛化能力。然而，对大规模高质量数据对的依赖限制了其在更多模态上的进一步发展。本文提出一种新颖的训练高效方法，可在无需配对数据的情况下学习多模态对比表示，称为连接多模态对比表示。具体来说，给定已在(A, B)和(B, C)模态对上预训练的两个现有MCR，我们将其投影至新空间，并利用重叠模态B的数据实现两个MCR在该新空间的对齐。同时，由于(A, B)和(B, C)模态对已在各自MCR内完成对齐，通过重叠模态习得的连接可迁移至非重叠模态对(A, C)。为充分发挥C-MCR潜力，我们进一步引入语义增强的MCR间与MCR内连接方法：首先增强跨模态嵌入的语义一致性与完整性以实现更鲁棒的对齐；继而利用MCR间对齐建立连接，并通过MCR内对齐更好地维持非重叠模态输入间的连接。我们以音视频对比学习领域为例验证C-MCR的有效性，通过文本连接预训练的CLIP与CLAP模型以推导音视频对比表示。值得注意的是，在未使用任何配对音视频数据且无需进一步微调的情况下，C-MCR在三个音视频下游任务的六个数据集上均达到最优性能。