The variable selection problem is to discover which of a large set of predictors is associated with an outcome of interest, conditionally on the other predictors. This problem has been widely studied, but existing approaches lack either power against complex alternatives, robustness to model misspecification, computational efficiency, or quantification of evidence against individual hypotheses. We present tower PCM (tPCM), a statistically and computationally efficient solution to the variable selection problem that does not suffer from these shortcomings. tPCM adapts the best aspects of two existing procedures that are based on similar functionals: the holdout randomization test (HRT) and the projected covariance measure (PCM). The former is a model-X test that utilizes many resamples and few machine learning fits, while the latter is an asymptotic doubly-robust style test for a single hypothesis that requires no resamples and many machine learning fits. Theoretically, we demonstrate the validity of tPCM, and perhaps surprisingly, the asymptotic equivalence of HRT, PCM, and tPCM. In so doing, we clarify the relationship between two methods from two separate literatures. An extensive simulation study verifies that tPCM can have significant computational savings compared to HRT and PCM, while maintaining nearly identical power.
翻译:变量选择问题旨在从大量预测变量中识别出在给定其他预测变量的条件下与目标变量相关的子集。该问题已被广泛研究,但现有方法在复杂备择假设下的检验功效、模型误设的稳健性、计算效率或针对单个假设的证据量化等方面存在不足。本文提出塔式投影协方差度量(tPCM),这是一种统计与计算双重高效的变量选择解决方案,克服了上述缺陷。tPCM融合了基于相似泛函的两种现有方法——保留随机化检验(HRT)与投影协方差度量(PCM)的核心优势。前者是模型-X检验,通过大量重抽样和少量机器学习拟合实现;后者则是针对单一假设的渐近双重稳健型检验,无需重抽样但需多次机器学习拟合。理论上,我们证明了tPCM的有效性,并揭示了HRT、PCM与tPCM具有渐近等价性(这一结论颇具启发性),从而厘清了来自两个独立学术体系的两种方法之间的理论关联。大量仿真实验表明,相较于HRT和PCM,tPCM在保持几乎相同检验功效的同时,能实现显著的计算效率提升。