Model-X approaches to testing conditional independence between a predictor and an outcome variable given a vector of covariates usually assume exact knowledge of the conditional distribution of the predictor given the covariates. Nevertheless, model-X methodologies are often deployed with this conditional distribution learned in sample. We investigate the consequences of this choice through the lens of the distilled conditional randomization test (dCRT). We find that Type-I error control is still possible, but only if the mean of the outcome variable given the covariates is estimated well enough. This demonstrates that the dCRT is doubly robust, and motivates a comparison to the generalized covariance measure (GCM) test, another doubly robust conditional independence test. We prove that these two tests are asymptotically equivalent, and show that the GCM test is optimal against (generalized) partially linear alternatives by leveraging semiparametric efficiency theory. In an extensive simulation study, we compare the dCRT to the GCM test. These two tests have broadly similar Type-I error and power, though dCRT can have somewhat better Type-I error control but somewhat worse power in small samples or when the response is discrete. We also find that post-lasso based test statistics (as compared to lasso based statistics) can dramatically improve Type-I error control for both methods.
翻译:模型X方法用于检验在给定协变量向量的条件下预测变量与结果变量之间的条件独立性,通常假定我们精确知道给定协变量时预测变量的条件分布。然而,模型X方法在实际应用中常常是基于样本学习得到的条件分布来部署的。我们通过蒸馏条件随机化检验(dCRT)的视角研究了这一选择带来的后果。我们发现,第一类错误控制仍然可能实现,但前提是能够充分估计给定协变量时结果变量的均值。这表明dCRT具有双重稳健性,并促使其与另一种双重稳健的条件独立性检验——广义协方差度量(GCM)检验进行比较。我们证明了这两种检验渐近等价,并通过利用半参数效率理论,表明GCM检验在(广义)部分线性备择假设下是最优的。在广泛的模拟研究中,我们比较了dCRT与GCM检验。这两种检验的第一类错误和检验功效大致相似,尽管dCRT在样本量较小或响应变量为离散时,其第一类错误控制可能稍好,但检验功效可能稍差。我们还发现,与基于lasso的检验统计量相比,基于后lasso的检验统计量可以显著改善两种方法的第一类错误控制。