The model-X conditional randomization test is a generic framework for conditional independence testing, unlocking new possibilities to discover features that are conditionally associated with a response of interest while controlling type-I error rates. An appealing advantage of this test is that it can work with any machine learning model to design powerful test statistics. In turn, the common practice in the model-X literature is to form a test statistic using machine learning models, trained to maximize predictive accuracy with the hope to attain a test with good power. However, the ideal goal here is to drive the model (during training) to maximize the power of the test, not merely the predictive accuracy. In this paper, we bridge this gap by introducing, for the first time, novel model-fitting schemes that are designed to explicitly improve the power of model-X tests. This is done by introducing a new cost function that aims at maximizing the test statistic used to measure violations of conditional independence. Using synthetic and real data sets, we demonstrate that the combination of our proposed loss function with various base predictive models (lasso, elastic net, and deep neural networks) consistently increases the number of correct discoveries obtained, while maintaining type-I error rates under control.
翻译:模型X条件随机化检验是一种用于条件独立性检验的通用框架,它开启了发现与感兴趣响应变量存在条件关联特征的新可能,同时控制第一类错误率。该检验的一个显著优势在于,它可配合任意机器学习模型设计具有高统计功效的检验统计量。然而,当前模型X文献中的常见做法是使用以预测准确性最大化为训练目标的机器学习模型构建检验统计量,期望由此获得高功效检验。但这里的根本目标应是在模型训练过程中引导其最大化检验的统计功效,而非仅仅追求预测准确性。本文首次通过引入旨在明确提升模型X检验统计功效的新型模型拟合方案,弥补了这一差距。我们通过设计新的代价函数来实现这一目标,该函数致力于最大化用于衡量条件独立性违背程度的检验统计量。基于合成数据集和真实数据集的实验表明,将本文提出的损失函数与各类基础预测模型(套索回归、弹性网络和深度神经网络)相结合,能够在保持第一类错误率受控的前提下,持续增加正确发现的变量数量。