Humans perceive the world through multisensory integration, blending the information of different modalities to adapt their behavior. Contrastive learning offers an appealing solution for multimodal self-supervised learning. Indeed, by considering each modality as a different view of the same entity, it learns to align features of different modalities in a shared representation space. However, this approach is intrinsically limited as it only learns shared or redundant information between modalities, while multimodal interactions can arise in other ways. In this work, we introduce CoMM, a Contrastive MultiModal learning strategy that enables the communication between modalities in a single multimodal space. Instead of imposing cross- or intra- modality constraints, we propose to align multimodal representations by maximizing the mutual information between augmented versions of these multimodal features. Our theoretical analysis shows that shared, synergistic and unique terms of information naturally emerge from this formulation, allowing us to estimate multimodal interactions beyond redundancy. We test CoMM both in a controlled and in a series of real-world settings: in the former, we demonstrate that CoMM effectively captures redundant, unique and synergistic information between modalities. In the latter, CoMM learns complex multimodal interactions and achieves state-of-the-art results on the seven multimodal benchmarks. Code is available at https://github.com/Duplums/CoMM
翻译:人类通过多感官整合感知世界,融合不同模态的信息以调整自身行为。对比学习为多模态自监督学习提供了一种颇具吸引力的解决方案。通过将每个模态视为同一实体的不同视图,该方法学习在共享表示空间中对齐不同模态的特征。然而,这种方法存在固有局限,因为它仅学习模态间的共享或冗余信息,而多模态交互可能以其他方式产生。本工作中,我们提出CoMM——一种在单一多模态空间内实现模态间通信的对比多模态学习策略。我们不施加跨模态或模态内约束,而是建议通过最大化这些多模态特征增强版本间的互信息来实现多模态表示的对齐。理论分析表明,该公式自然衍生出信息的共享项、协同项与独特项,使我们能够估计超越冗余的多模态交互作用。我们在受控环境与一系列现实场景中对CoMM进行测试:在受控实验中,我们证明CoMM能有效捕捉模态间的冗余、独特及协同信息;在现实场景中,CoMM能够学习复杂的多模态交互,并在七个多模态基准测试中取得最先进的性能。代码发布于https://github.com/Duplums/CoMM