Self-Supervised Multimodal Learning: A Survey

Multimodal learning, which aims to understand and analyze information from multiple modalities, has achieved substantial progress in the supervised regime in recent years. However, the heavy dependence on data paired with expensive human annotations impedes scaling up models. Meanwhile, given the availability of large-scale unannotated data in the wild, self-supervised learning has become an attractive strategy to alleviate the annotation bottleneck. Building on these two directions, self-supervised multimodal learning (SSML) provides ways to leverage supervision from raw multimodal data. In this survey, we provide a comprehensive review of the state-of-the-art in SSML, which we categorize along three orthogonal axes: objective functions, data alignment, and model architectures. These axes correspond to the inherent characteristics of self-supervised learning methods and multimodal data. Specifically, we classify training objectives into instance discrimination, clustering, and masked prediction categories. We also discuss multimodal input data pairing and alignment strategies during training. Finally, we review model architectures including the design of encoders, fusion modules, and decoders, which are essential components of SSML methods. We review downstream multimodal application tasks, reporting the concrete performance of the state-of-the-art image-text models and multimodal video models, and also review real-world applications of SSML algorithms in diverse fields such as healthcare, remote sensing, and machine translation. Finally, we discuss challenges and future directions for SSML. A collection of related resources can be found at: https://github.com/ys-zong/awesome-self-supervised-multimodal-learning.

翻译：多模态学习旨在理解和分析来自多种模态的信息，近年来在有监督范式下取得了显著进展。然而，对带有昂贵人工标注的数据配对的高度依赖阻碍了模型规模的扩展。与此同时，由于野外大规模未标注数据的可用性，自监督学习已成为缓解标注瓶颈的有效策略。基于这两个方向，自监督多模态学习（SSML）提供了利用原始多模态数据中监督信号的方法。本综述对SSML领域的最新进展进行了全面回顾，我们将其沿三个正交维度进行归类：目标函数、数据对齐和模型架构。这些维度分别对应自监督学习方法与多模态数据的内在特性。具体而言，我们将训练目标分为实例判别、聚类和掩码预测三类。同时讨论训练过程中多模态输入数据的配对与对齐策略。最后，我们回顾了模型架构，包括编码器、融合模块和解码器的设计——这些是SSML方法的核心组成部分。我们梳理了下游多模态应用任务，报告了最新图像-文本模型与多模态视频模型的具体性能，并回顾了SSML算法在医疗健康、遥感、机器翻译等不同领域的实际应用。最后讨论了SSML面临的挑战与未来方向。相关资源汇总见：https://github.com/ys-zong/awesome-self-supervised-multimodal-learning。