The recent explosion of interest in multimodal applications has resulted in a wide selection of datasets and methods for representing and integrating information from different modalities. Despite these empirical advances, there remain fundamental research questions: How can we quantify the interactions that are necessary to solve a multimodal task? Subsequently, what are the most suitable multimodal models to capture these interactions? To answer these questions, we propose an information-theoretic approach to quantify the degree of redundancy, uniqueness, and synergy relating input modalities with an output task. We term these three measures as the PID statistics of a multimodal distribution (or PID for short), and introduce two new estimators for these PID statistics that scale to high-dimensional distributions. To validate PID estimation, we conduct extensive experiments on both synthetic datasets where the PID is known and on large-scale multimodal benchmarks where PID estimations are compared with human annotations. Finally, we demonstrate their usefulness in (1) quantifying interactions within multimodal datasets, (2) quantifying interactions captured by multimodal models, (3) principled approaches for model selection, and (4) three real-world case studies engaging with domain experts in pathology, mood prediction, and robotic perception where our framework helps to recommend strong multimodal models for each application.
翻译:近年来,多模态应用的激增催生了大量用于表示和整合不同模态信息的多样数据集与方法。尽管取得了这些实证进展,仍存在基础性研究问题:如何量化解决多模态任务所必需的交互?进而,哪些多模态模型最适合捕捉这些交互?为回答这些问题,我们提出了一种信息论方法,用于量化输入模态与输出任务相关的冗余性、独特性和协同性程度。我们将这三类度量统称为多模态分布的PID统计量(简称PID),并引入两种可扩展至高维分布的新型PID统计量估计器。为验证PID估计的有效性,我们基于已知PID的合成数据集以及将PID估计与人类标注进行对比的大规模多模态基准开展了大量实验。最后,我们展示了其在下述四方面的实用性:(1)量化多模态数据集内部的交互;(2)量化多模态模型所捕捉的交互;(3)基于原则的模型选择方法;(4)与病理学、情绪预测和机器人感知领域专家合作的三项实际案例研究——在该框架中,我们为每项应用推荐了最优的多模态模型。