Deep multimodal learning has shown remarkable success by leveraging contrastive learning to capture explicit one-to-one relations across modalities. However, real-world data often exhibits shared relations beyond simple pairwise associations. We propose M3CoL, a Multimodal Mixup Contrastive Learning approach to capture nuanced shared relations inherent in multimodal data. Our key contribution is a Mixup-based contrastive loss that learns robust representations by aligning mixed samples from one modality with their corresponding samples from other modalities thereby capturing shared relations between them. For multimodal classification tasks, we introduce a framework that integrates a fusion module with unimodal prediction modules for auxiliary supervision during training, complemented by our proposed Mixup-based contrastive loss. Through extensive experiments on diverse datasets (N24News, ROSMAP, BRCA, and Food-101), we demonstrate that M3CoL effectively captures shared multimodal relations and generalizes across domains. It outperforms state-of-the-art methods on N24News, ROSMAP, and BRCA, while achieving comparable performance on Food-101. Our work highlights the significance of learning shared relations for robust multimodal learning, opening up promising avenues for future research.
翻译:深度多模态学习通过利用对比学习捕获跨模态的显式一对一关系,已展现出显著成功。然而,真实世界数据通常表现出超越简单成对关联的共享关系。我们提出M3CoL,一种多模态混合对比学习方法,旨在捕获多模态数据中固有的细微共享关系。我们的核心贡献是一种基于混合的对比损失函数,它通过将一个模态的混合样本与其他模态的对应样本对齐来学习鲁棒表示,从而捕获它们之间的共享关系。对于多模态分类任务,我们引入了一个框架,该框架将融合模块与单模态预测模块集成,用于训练期间的辅助监督,并辅以我们提出的基于混合的对比损失。通过在多个数据集(N24News、ROSMAP、BRCA和Food-101)上进行广泛实验,我们证明M3CoL能有效捕获共享的多模态关系并实现跨领域泛化。它在N24News、ROSMAP和BRCA上超越了现有最先进方法,同时在Food-101上取得了相当的性能。我们的工作凸显了学习共享关系对于鲁棒多模态学习的重要性,为未来研究开辟了有前景的途径。