Approximating mutual information of high-dimensional variables using learned representations

Mutual information (MI) is a general measure of statistical dependence with widespread application across the sciences. However, estimating MI between multi-dimensional variables is challenging because the number of samples necessary to converge to an accurate estimate scales unfavorably with dimensionality. In practice, existing techniques can reliably estimate MI in up to tens of dimensions, but fail in higher dimensions, where sufficient sample sizes are infeasible. Here, we explore the idea that underlying low-dimensional structure in high-dimensional data can be exploited to faithfully approximate MI in high-dimensional settings with realistic sample sizes. We develop a method that we call latent MI (LMI) approximation, which applies a nonparametric MI estimator to low-dimensional representations learned by a simple, theoretically-motivated model architecture. Using several benchmarks, we show that unlike existing techniques, LMI can approximate MI well for variables with $> 10^3$ dimensions if their dependence structure has low intrinsic dimensionality. Finally, we showcase LMI on two open problems in biology. First, we approximate MI between protein language model (pLM) representations of interacting proteins, and find that pLMs encode non-trivial information about protein-protein interactions. Second, we quantify cell fate information contained in single-cell RNA-seq (scRNA-seq) measurements of hematopoietic stem cells, and find a sharp transition during neutrophil differentiation when fate information captured by scRNA-seq increases dramatically.

翻译：互信息（MI）是一种广泛应用于各科学领域的统计依赖性通用度量。然而，估计多维变量间的互信息具有挑战性，因为收敛到准确估计所需的样本数量随维度增加而呈不利增长。在实践中，现有技术可可靠估计维度在数十以内的互信息，但在更高维度下会失效——因为所需样本规模在实际中无法实现。本文探讨的核心思想是：通过利用高维数据中潜在的低维结构，可以在实际样本量条件下实现对高维场景中互信息的可靠近似。我们提出了一种称为潜在互信息（LMI）近似的方法，该方法将非参数互信息估计器应用于通过理论驱动的简单模型架构学习得到的低维表示。通过多个基准测试，我们证明：与现有技术不同，当变量的依赖结构具有低本征维度时，LMI能对维度超过 $10^3$ 的变量实现良好的互信息近似。最后，我们在两个生物学开放问题上展示了LMI的应用价值。首先，我们近似了相互作用蛋白质的蛋白质语言模型（pLM）表示之间的互信息，发现pLM编码了关于蛋白质-蛋白质相互作用的非平凡信息。其次，我们量化了造血干细胞单细胞RNA测序（scRNA-seq）数据中包含的细胞命运信息，发现当中性粒细胞分化过程中scRNA-seq捕获的命运信息急剧增加时，存在一个明显的转变阶段。