Testing the significance of a variable or group of variables $X$ for predicting a response $Y$, given additional covariates $Z$, is a ubiquitous task in statistics. A simple but common approach is to specify a linear model, and then test whether the regression coefficient for $X$ is non-zero. However, when the model is misspecified, the test may have poor power, for example when $X$ is involved in complex interactions, or lead to many false rejections. In this work we study the problem of testing the model-free null of conditional mean independence, i.e. that the conditional mean of $Y$ given $X$ and $Z$ does not depend on $X$. We propose a simple and general framework that can leverage flexible nonparametric or machine learning methods, such as additive models or random forests, to yield both robust error control and high power. The procedure involves using these methods to perform regressions, first to estimate a form of projection of $Y$ on $X$ and $Z$ using one half of the data, and then to estimate the expected conditional covariance between this projection and $Y$ on the remaining half of the data. While the approach is general, we show that a version of our procedure using spline regression achieves what we show is the minimax optimal rate in this nonparametric testing problem. Numerical experiments demonstrate the effectiveness of our approach both in terms of maintaining Type I error control, and power, compared to several existing approaches.
翻译:检验给定额外协变量$Z$时,预测响应$Y$的变量或变量组$X$的显著性,是统计学中的常见任务。一种简单但常用的方法是设定线性模型,然后检验$X$的回归系数是否非零。然而,当模型设定错误时,该检验可能效能低下(例如当$X$涉及复杂交互作用时),或导致大量错误拒绝。本文研究条件均值独立性的无模型原假设检验问题,即给定$X$和$Z$时$Y$的条件均值是否不依赖于$X$。我们提出一个简单且通用的框架,可利用灵活的非参数或机器学习方法(如加性模型或随机森林)同时实现稳健的误差控制和高检验效能。该流程通过回归方法实现:首先使用一半数据估计$Y$在$X$和$Z$上的某种投影形式,然后在剩余数据上估计该投影与$Y$的期望条件协方差。尽管该框架具有通用性,我们证明采用样条回归的特定版本在此非参数检验问题中能达到极小化最优速率。数值实验表明,与现有若干方法相比,我们的方法在控制第一类错误率和检验效能方面均表现有效。