The recent explosion of interest in multimodal applications has resulted in a wide selection of datasets and methods for representing and integrating information from different signals. Despite these empirical advances, there remain fundamental research questions: how can we quantify the nature of interactions that exist among input features? Subsequently, how can we capture these interactions using suitable data-driven methods? To answer this question, we propose an information-theoretic approach to quantify the degree of redundancy, uniqueness, and synergy across input features, which we term the PID statistics of a multimodal distribution. Using 2 newly proposed estimators that scale to high-dimensional distributions, we demonstrate their usefulness in quantifying the interactions within multimodal datasets, the nature of interactions captured by multimodal models, and principled approaches for model selection. We conduct extensive experiments on both synthetic datasets where the PID statistics are known and on large-scale multimodal benchmarks where PID estimation was previously impossible. Finally, to demonstrate the real-world applicability of our approach, we present three case studies in pathology, mood prediction, and robotic perception where our framework accurately recommends strong multimodal models for each application.
翻译:近年来,多模态应用的蓬勃兴趣催生了大量用于表示和整合不同信号信息的数据集与方法。尽管这些实证研究取得了进展,但仍有基础研究问题亟待解决:如何量化输入特征间存在的交互本质?进而,如何利用合适的数据驱动方法捕捉这些交互?针对这些问题,我们提出了一种基于信息论的方法来量化输入特征间的冗余性、独特性和协同性,并将其定义为多模态分布的PID统计量。借助两种可扩展至高维分布的新估计算法,我们证明了其在量化多模态数据集内交互、多模态模型捕获的交互本质以及模型选择的原理性方法中的有效性。我们在已知PID统计量的合成数据集和此前无法实现PID估计的大规模多模态基准数据集上进行了广泛实验。最后,为展示方法的实际应用价值,我们在病理学、情绪预测和机器人感知三个案例研究中证明,该框架能为每项应用准确推荐强鲁棒的多模态模型。