Multimodal learning, which aims to understand and analyze information from multiple modalities, has achieved substantial progress in the supervised regime in recent years. However, the heavy dependence on data paired with expensive human annotations impedes scaling up models. Meanwhile, given the availability of large-scale unannotated data in the wild, self-supervised learning has become an attractive strategy to alleviate the annotation bottleneck. Building on these two directions, self-supervised multimodal learning (SSML) provides ways to learn from raw multimodal data. In this survey, we provide a comprehensive review of the state-of-the-art in SSML, in which we elucidate three major challenges intrinsic to self-supervised learning with multimodal data: (1) learning representations from multimodal data without labels, (2) fusion of different modalities, and (3) learning with unaligned data. We then detail existing solutions to these challenges. Specifically, we consider (1) objectives for learning from multimodal unlabeled data via self-supervision, (2) model architectures from the perspective of different multimodal fusion strategies, and (3) pair-free learning strategies for coarse-grained and fine-grained alignment. We also review real-world applications of SSML algorithms in diverse fields such as healthcare, remote sensing, and machine translation. Finally, we discuss challenges and future directions for SSML. A collection of related resources can be found at: https://github.com/ys-zong/awesome-self-supervised-multimodal-learning.
翻译:多模态学习旨在理解和分析来自多种模态的信息,近年来在监督学习范式下取得了显著进展。然而,对昂贵人工标注数据配对的严重依赖限制了模型的规模化扩展。与此同时,鉴于大规模无标注自然数据的可用性,自监督学习已成为缓解标注瓶颈的有效策略。基于这两个方向,自监督多模态学习(SSML)提供了从原始多模态数据中学习的方法。本综述全面梳理了SSML的最新进展,阐明了多模态数据自监督学习面临的三大核心挑战:(1) 无标签条件下从多模态数据中学习表征,(2) 不同模态的融合,(3) 非对齐数据的学习。随后,我们详细阐述了针对这些挑战的现有解决方案,具体包括:(1) 基于自监督的未标注多模态数据学习目标,(2) 从不同多模态融合策略角度设计的模型架构,(3) 面向粗粒度与细粒度对齐的无配对学习策略。此外,我们还回顾了SSML算法在医疗、遥感、机器翻译等不同领域的实际应用。最后,我们讨论了SSML面临的挑战与未来方向。相关资源汇总参见:https://github.com/ys-zong/awesome-self-supervised-multimodal-learning。