An Approximation Based Theory of Linear Regression

The goal of this paper is to provide a theory linear regression based entirely on approximations. It will be argued that the standard linear regression model based theory whether frequentist or Bayesian has failed and that this failure is due to an 'assumed (revealed?) truth' (John Tukey) attitude to the models. This is reflected in the language of statistical inference which involves a concept of truth, for example efficiency, consistency and hypothesis testing. The motivation behind this paper was to remove the word `true' from the theory and practice of linear regression and to replace it by approximation. The approximations considered are the least squares approximations. An approximation is called valid if it contains no irrelevant covariates. This is operationalized using the concept of a Gaussian P-value which is the probability that pure Gaussian noise is better in term of least squares than the covariate. The precise definition given in the paper is intuitive and requires only four simple equations. Given this a valid approximation is one where all the Gaussian P-values are less than a threshold $p0$ specified by the statistician, in this paper with the default value 0.01. This approximations approach is not only much simpler it is overwhelmingly better than the standard model based approach. This will be demonstrated using six real data sets, four from high dimensional regression and two from vector autoregression. Both the simplicity and the superiority of Gaussian P-values derive from their universal exactness and validity. This is in complete contrast to standard F P-values which are valid only for carefully designed simulations. The paper contains excerpts from an unpublished paper by John Tukey entitled `Issues relevant to an honest account of data-based inference partially in the light of Laurie Davies's paper'.

翻译：本文旨在建立一种完全基于近似理论的线性回归理论。我们将论证，无论是频率学派还是贝叶斯学派的标准线性回归模型理论均已失效，其根源在于对模型采取了一种"预设的（还是揭示的？）真理"（约翰·图基语）态度。这种态度体现在统计推断语言中——涉及效率、一致性和假设检验等真理概念。本文的动机是消除线性回归理论与实践中"真实"一词，代之以"近似"。所考虑的近似为最小二乘近似。若近似不包含无关协变量，则称为有效近似。这一概念通过高斯P值实现操作化，该值定义为纯高斯噪声在最小二乘意义上优于协变量的概率。文中给出的精确定义直观明确，仅需四个简单方程。基于此，有效近似即指所有高斯P值均小于统计学家设定的阈值$p0$（本文默认值为0.01）。这种近似方法不仅更为简洁，且显著优于标准模型方法。我们将通过六个真实数据集（四个来自高维回归，两个来自向量自回归）验证这一点。高斯P值的简洁性与优越性源于其普适精确性和有效性，这与标准F P值形成鲜明对比——后者仅对精心设计的模拟有效。本文还收录了约翰·图基未发表论文《关于诚实对待基于数据推断的若干问题——部分基于劳里·戴维斯论文的思考》的节选。