The recent explosion of interest in multimodal applications has resulted in a wide selection of datasets and methods for representing and integrating information from different modalities. Despite these empirical advances, there remain fundamental research questions: How can we quantify the interactions that are necessary to solve a multimodal task? Subsequently, what are the most suitable multimodal models to capture these interactions? To answer these questions, we propose an information-theoretic approach to quantify the degree of redundancy, uniqueness, and synergy relating input modalities with an output task. We term these three measures as the PID statistics of a multimodal distribution (or PID for short), and introduce two new estimators for these PID statistics that scale to high-dimensional distributions. To validate PID estimation, we conduct extensive experiments on both synthetic datasets where the PID is known and on large-scale multimodal benchmarks where PID estimations are compared with human annotations. Finally, we demonstrate their usefulness in (1) quantifying interactions within multimodal datasets, (2) quantifying interactions captured by multimodal models, (3) principled approaches for model selection, and (4) three real-world case studies engaging with domain experts in pathology, mood prediction, and robotic perception where our framework helps to recommend strong multimodal models for each application.
翻译:近期对多模态应用的广泛兴趣催生了大量用于表示和整合不同模态信息的数据集与方法。尽管取得了实证进展,但仍有基础研究问题亟待解答:如何量化解决多模态任务所必需的交互?进而,何种多模态模型最能有效捕获这些交互?针对这些问题,我们提出了一种基于信息论的方法,用于量化输入模态与输出任务之间的冗余度、独特性和协同度。我们将这三种度量统称为多模态分布的PID统计量(简称PID),并引入了两种可扩展至高维分布的PID估计器。为验证PID估计的有效性,我们在已知PID的合成数据集上进行了广泛实验,并在大规模多模态基准测试中将PID估计结果与人工标注进行对比。最后,我们展示了该框架在以下四个场景中的实用性:(1)量化多模态数据集内部的交互;(2)量化多模态模型捕获的交互;(3)基于原则的模型选择方法;(4)包含病理学、情绪预测及机器人感知领域的三个真实案例研究——在这些案例中,我们的框架为每项应用推荐了最合适的多模态模型。