Conditional randomization tests (CRTs) assess whether a variable $x$ is predictive of another variable $y$, having observed covariates $z$. CRTs require fitting a large number of predictive models, which is often computationally intractable. Existing solutions to reduce the cost of CRTs typically split the dataset into a train and test portion, or rely on heuristics for interactions, both of which lead to a loss in power. We propose the decoupled independence test (DIET), an algorithm that avoids both of these issues by leveraging marginal independence statistics to test conditional independence relationships. DIET tests the marginal independence of two random variables: $F(x \mid z)$ and $F(y \mid z)$ where $F(\cdot \mid z)$ is a conditional cumulative distribution function (CDF). These variables are termed "information residuals." We give sufficient conditions for DIET to achieve finite sample type-1 error control and power greater than the type-1 error rate. We then prove that when using the mutual information between the information residuals as a test statistic, DIET yields the most powerful conditionally valid test. Finally, we show DIET achieves higher power than other tractable CRTs on several synthetic and real benchmarks.
翻译:摘要:条件随机化检验(CRT)用于评估在已观测到协变量 $z$ 的情况下,变量 $x$ 是否对另一变量 $y$ 具有预测能力。CRT 需要拟合大量预测模型,这通常计算上难以实现。现有降低 CRT 成本的方法通常将数据集划分为训练集和测试集,或依赖交互效应的启发式策略,但这些做法均会降低检验效能。我们提出解耦独立性检验(DIET),该算法通过利用边际独立性统计量检验条件独立性关系,避免了上述两个问题。DIET 检验两个随机变量的边际独立性:$F(x \mid z)$ 和 $F(y \mid z)$,其中 $F(\cdot \mid z)$ 是条件累积分布函数(CDF)。这些变量被称为“信息残差”。我们给出了 DIET 实现有限样本第一类错误控制并取得大于第一类错误率检验效能的充分条件,进而证明了当使用信息残差之间的互信息作为检验统计量时,DIET 能获得最有效的条件有效检验。最后,我们在多个合成与真实基准测试中证明,DIET 相比其他可实现的 CRT 方法具有更高检验效能。