Test of independence is of fundamental importance in modern data analysis, with broad applications in variable selection, graphical models, and causal inference. When the data is high dimensional and the potential dependence signal is sparse, independence testing becomes very challenging without distributional or structural assumptions. In this paper, we propose a general framework for independence testing by first fitting a classifier that distinguishes the joint and product distributions, and then testing the significance of the fitted classifier. This framework allows us to borrow the strength of the most advanced classification algorithms developed from the modern machine learning community, making it applicable to high dimensional, complex data. By combining a sample split and a fixed permutation, our test statistic has a universal, fixed Gaussian null distribution that is independent of the underlying data distribution. Extensive simulations demonstrate the advantages of the newly proposed test compared with existing methods. We further apply the new test to a single-cell data set to test the independence between two types of single-cell sequencing measurements, whose high dimensionality and sparsity make existing methods hard to apply.
翻译:独立性检验在现代数据分析中具有基础重要性,广泛应用于变量选择、图模型和因果推断。当数据具有高维特性且潜在的依赖信号稀疏时,若无分布或结构假设,独立性检验变得极具挑战性。本文提出一个通用独立性检验框架:首先拟合区分联合分布与乘积分布的分类器,然后检验该分类器的显著性。该框架能够借鉴现代机器学习社区发展的最先进分类算法的优势,使其适用于高维复杂数据。通过结合样本拆分与固定置换,我们的检验统计量具有通用的、固定的高斯零分布,且该分布与底层数据分布无关。大量仿真实验证明了新提出的检验方法相较于现有方法的优势。我们进一步将该新检验应用于单细胞数据集,检验两种单细胞测序测量之间的独立性——其高维度和稀疏性使现有方法难以应用。