Mutual information is a general statistical dependency measure which has found applications in representation learning, causality, domain generalization and computational biology. However, mutual information estimators are typically evaluated on simple families of probability distributions, namely multivariate normal distribution and selected distributions with one-dimensional random variables. In this paper, we show how to construct a diverse family of distributions with known ground-truth mutual information and propose a language-independent benchmarking platform for mutual information estimators. We discuss the general applicability and limitations of classical and neural estimators in settings involving high dimensions, sparse interactions, long-tailed distributions, and high mutual information. Finally, we provide guidelines for practitioners on how to select appropriate estimator adapted to the difficulty of problem considered and issues one needs to consider when applying an estimator to a new data set.
翻译:互信息是一种通用的统计依赖性度量,已在表示学习、因果关系、领域泛化和计算生物学等领域得到应用。然而,互信息估计器通常仅在简单的概率分布族上进行评估,即多元正态分布和具有一维随机变量的选定分布。本文展示了如何构建一个具有已知真实互信息的多样化分布族,并提出了一个语言无关的互信息估计器基准测试平台。我们讨论了经典和神经估计器在高维、稀疏交互、长尾分布和高互信息设置中的通用适用性及局限性。最后,我们为实践者提供了如何根据所考虑问题的难度选择适当估计器的指导原则,以及将估计器应用于新数据集时需注意的问题。