Multimodal learning has mainly focused on learning large models on, and fusing feature representations from, different modalities for better performances on downstream tasks. In this work, we take a detour from this trend and study the intrinsic nature of multimodal data by asking the following questions: 1) Can we learn more structured latent representations of general multimodal data?; and 2) can we intuitively understand, both mathematically and visually, what the latent representations capture? To answer 1), we propose a general and lightweight framework, Multimodal Understanding Through Correlation Maximization and Minimization (MUCMM), that can be incorporated into any large pre-trained network. MUCMM learns both the common and individual representations. The common representations capture what is common between the modalities; the individual representations capture the unique aspect of the modalities. To answer 2), we propose novel scores that summarize the learned common and individual structures and visualize the score gradients with respect to the input, visually discerning what the different representations capture. We further provide mathematical intuitions of the computed gradients in a linear setting, and demonstrate the effectiveness of our approach through a variety of experiments.
翻译:多模态学习主要关注在不同模态上训练大型模型以及融合来自不同模态的特征表示,以提升下游任务的性能。在本工作中,我们偏离这一主流趋势,通过提出以下问题来研究多模态数据的内在本质:1)我们能否学习到通用多模态数据的更具结构化的潜在表示?2)我们能否从数学和视觉两个层面直观理解潜在表示所捕获的内容?针对问题1),我们提出一种通用且轻量级的框架——基于相关最大化和最小化的多模态理解(MUCMM),该框架可被集成到任何大型预训练网络中。MUCMM能够同时学习共性表示和个性表示:共性表示捕获模态之间的共同特征,而个性表示则捕获模态的独特方面。针对问题2),我们提出了新的评分指标来总结已学习的共性和个性结构,并通过可视化评分相对于输入的梯度,从视觉上区分不同表示所捕获的内容。我们进一步在线性设置下提供计算梯度的数学直觉,并通过一系列实验证明了我们方法的有效性。