Extending Multi-modal Contrastive Representations

Multi-modal contrastive representation (MCR) of more than three modalities is critical in multi-modal learning. Although recent methods showcase impressive achievements, the high dependence on large-scale, high-quality paired data and the expensive training costs limit their further development. Inspired by recent C-MCR, this paper proposes Extending Multimodal Contrastive Representation (Ex-MCR), a training-efficient and paired-data-free method to flexibly learn unified contrastive representation space for more than three modalities by integrating the knowledge of existing MCR spaces. Specifically, Ex-MCR aligns multiple existing MCRs into the same based MCR, which can effectively preserve the original semantic alignment of the based MCR. Besides, we comprehensively enhance the entire learning pipeline for aligning MCR spaces from the perspectives of training data, architecture, and learning objectives. With the preserved original modality alignment and the enhanced space alignment, Ex-MCR shows superior representation learning performance and excellent modality extensibility. To demonstrate the effectiveness of Ex-MCR, we align the MCR spaces of CLAP (audio-text) and ULIP (3D-vision) into the CLIP (vision-text), leveraging the overlapping text and image modality, respectively. Remarkably, without using any paired data, Ex-MCR learns a 3D-image-text-audio unified contrastive representation, and it achieves state-of-the-art performance on audio-visual, 3D-image, audio-text, visual-text retrieval, and 3D object classification tasks. More importantly, extensive qualitative results further demonstrate the emergent semantic alignment between the extended modalities (e.g., audio and 3D), which highlights the great potential of modality extensibility.

翻译：多模态对比表征（MCR）在超过三种模态的场景下对多模态学习至关重要。尽管近期方法取得了令人瞩目的成果，但其对大规模高质量配对数据的高度依赖以及高昂的训练成本限制了进一步发展。受近期C-MCR启发，本文提出扩展多模态对比表征（Ex-MCR），这是一种训练高效且无需配对数据的方法，通过整合现有MCR空间的知识，灵活学习超过三种模态的统一对比表征空间。具体而言，Ex-MCR将多个现有MCR空间对齐到同一基础MCR空间，从而有效保留基础MCR的原始语义对齐。此外，我们从训练数据、架构和学习目标等角度全面增强了MCR空间对齐的整个学习流程。凭借保留的原始模态对齐和增强的空间对齐，Ex-MCR展现了卓越的表征学习性能和出色的模态扩展性。为验证Ex-MCR的有效性，我们分别利用文本与图像模态的重叠，将CLAP（音频-文本）和ULIP（3D-视觉）的MCR空间对齐到CLIP（视觉-文本）空间。值得注意的是，在无需任何配对数据的情况下，Ex-MCR学习到了3D-图像-文本-音频的统一对比表征，并在音频-视觉、3D-图像、音频-文本、视觉-文本检索以及3D物体分类任务上取得了最先进性能。更重要的是，大量定性结果进一步揭示了扩展模态间（如音频与3D）的新兴语义对齐，这凸显了模态扩展性的巨大潜力。