In this paper we give a completely new approach to the problem of covariate selection in linear regression. A covariate or a set of covariates is included only if it is better in the sense of least squares than the same number of Gaussian covariates consisting of i.i.d. $N(0,1)$ random variables. The Gaussian P-value is defined as the probability that the Gaussian covariates are better. It is given in terms of the Beta distribution, it is exact and it holds for all data. The covariate selection procedures based on this require only a cut-off value $\alpha$ for the Gaussian P-value: the default value in this paper is $\alpha=0.01$. The resulting procedures are very simple, very fast, do not overfit and require only least squares. In particular there is no regularization parameter, no data splitting, no use of simulations, no shrinkage and no post selection inference is required. The paper includes the results of simulations, applications to real data sets and theorems on the asymptotic behaviour under the standard linear model. Here the stepwise procedure performs overwhelmingly better than any other procedure we are aware of. An R-package {\it gausscov} is available.
翻译:本文提出了一种全新的线性回归协变量选择方法。只有当某个或某组协变量在最小二乘意义上优于相同数量的独立同分布高斯随机变量(服从N(0,1)分布)时,才会被纳入模型。高斯P值定义为高斯协变量表现更优的概率,该值服从Beta分布,具有精确性且对任意数据均成立。基于此的协变量选择过程仅需设定高斯P值的截断阈值α(本文默认取α=0.01)。该选择方法具有简单快速、无过拟合且仅需最小二乘计算的特性,无需正则化参数、数据分割、模拟计算、收缩处理或事后选择推断。本文包含模拟实验、真实数据集应用以及标准线性模型下渐近行为的理论证明。实验表明,该逐步选择方法在所有已知方法中表现最优。文中还提供了R语言程序包gausscov。