Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. With the recent interest in video understanding, embodied autonomous agents, text-to-image generation, and multisensor fusion in application domains such as healthcare and robotics, multimodal machine learning has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this paper is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. We start by defining three key principles of modality heterogeneity, connections, and interactions that have driven subsequent innovations, and propose a taxonomy of six core technical challenges: representation, alignment, reasoning, generation, transference, and quantification covering historical and recent trends. Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches. We end by motivating several open problems for future research as identified by our taxonomy.

翻译：多模态机器学习是一个充满活力的多学科研究领域，旨在通过整合包括语言、声学、视觉、触觉和生理信号在内的多种通信模态，设计具有理解、推理和学习等智能能力的计算机智能体。随着近年来视频理解、具身自主智能体、文本到图像生成以及医疗健康和机器人等应用领域中对多传感器融合的关注，多模态机器学习因数据源的异质性以及模态间常存在的相互关联，给机器学习社区带来了独特的计算和理论挑战。然而，多模态研究领域的广泛进展使得识别其中的共同主题和未解问题变得困难。通过综合历史与当代视角下的广泛应用领域和理论框架，本文旨在概述多模态机器学习的计算与理论基础。我们首先定义了驱动后续创新的三个关键原理：模态异质性、模态连接与模态交互，并提出了一个包含六项核心技术挑战的分类体系：表征、对齐、推理、生成、迁移与量化，这些挑战涵盖了历史与近期趋势。通过这一分类体系的视角，我们将介绍近年来的技术成就，使研究者能够理解不同方法之间的相似性与差异。最后，基于我们的分类体系，我们提出若干未来研究中的开放性问题作为展望。