The recent explosion of interest in multimodal applications has resulted in a wide selection of datasets and methods for representing and integrating information from different modalities. Despite these empirical advances, there remain fundamental research questions: How can we quantify the interactions that are necessary to solve a multimodal task? Subsequently, what are the most suitable multimodal models to capture these interactions? To answer these questions, we propose an information-theoretic approach to quantify the degree of redundancy, uniqueness, and synergy relating input modalities with an output task. We term these three measures as the PID statistics of a multimodal distribution (or PID for short), and introduce two new estimators for these PID statistics that scale to high-dimensional distributions. To validate PID estimation, we conduct extensive experiments on both synthetic datasets where the PID is known and on large-scale multimodal benchmarks where PID estimations are compared with human annotations. Finally, we demonstrate their usefulness in (1) quantifying interactions within multimodal datasets, (2) quantifying interactions captured by multimodal models, (3) principled approaches for model selection, and (4) three real-world case studies engaging with domain experts in pathology, mood prediction, and robotic perception where our framework helps to recommend strong multimodal models for each application.
翻译:多模态应用近期研究热潮催生了大量用于表示和整合不同模态信息的数据库与方法。尽管实验层面取得进展,但仍有基础研究问题亟待解决:如何量化解决多模态任务所必需的交互?进而,何种多模态模型最适合捕捉这些交互?为回答上述问题,我们提出基于信息论的方法,量化输入模态与输出任务之间的冗余性、独特性和协同性。我们将这三类测度统称为多模态分布的PID统计量(简称PID),并引入两种适用于高维分布的新型PID统计量估计方法。通过已知PID值的合成数据集与大规模多模态基准测试(将PID估计值与人工标注进行比较)的广泛实验验证PID估计的有效性。最终,我们展示了该框架在以下四类场景中的应用价值:(1)量化多模态数据集内的交互;(2)量化多模态模型捕捉的交互;(3)基于原理的模型选择方法;(4)涉及病理诊断、情绪预测及机器人感知领域的三个真实案例研究,该框架可针对各应用场景推荐最优多模态模型。