This article addresses the problem of testing the conditional independence of two generic random vectors $X$ and $Y$ given a third random vector $Z$, which plays an important role in statistical and machine learning applications. We propose a new non-parametric testing procedure that avoids explicitly estimating any conditional distributions but instead requires sampling from the two marginal conditional distributions of $X$ given $Z$ and $Y$ given $Z$. We further propose using a generative neural network (GNN) framework to sample from these approximated marginal conditional distributions, which tends to mitigate the curse of dimensionality due to its adaptivity to any low-dimensional structures and smoothness underlying the data. Theoretically, our test statistic is shown to enjoy a doubly robust property against GNN approximation errors, meaning that the test statistic retains all desirable properties of the oracle test statistic utilizing the true marginal conditional distributions, as long as the product of the two approximation errors decays to zero faster than the parametric rate. Asymptotic properties of our statistic and the consistency of a bootstrap procedure are derived under both null and local alternatives. Extensive numerical experiments and real data analysis illustrate the effectiveness and broad applicability of our proposed test.
翻译:本文研究了在给定第三个随机向量$Z$的条件下,检验两个通用随机向量$X$与$Y$之间条件独立性的问题,该问题在统计学与机器学习应用中具有重要意义。我们提出了一种新的非参数检验方法,该方法避免显式估计任何条件分布,而是要求从$X$给定$Z$以及$Y$给定$Z$的两个边际条件分布中进行采样。我们进一步提出使用生成神经网络框架从这些近似的边际条件分布中采样,由于其能够自适应数据中潜在的任何低维结构与平滑性,从而有助于缓解维度灾难问题。理论上,我们的检验统计量被证明对GNN近似误差具有双重鲁棒性,这意味着只要两个近似误差的乘积以快于参数速率的速度衰减至零,该检验统计量就能保留利用真实边际条件分布的理想检验统计量的所有优良性质。我们在原假设与局部备择假设下推导了该统计量的渐近性质以及自助法程序的一致性。大量的数值实验与真实数据分析验证了我们所提出检验方法的有效性与广泛适用性。