Regression analysis based on many covariates is becoming increasingly common. However, when the number of covariates $p$ is of the same order as the number of observations $n$, maximum likelihood regression becomes unreliable due to overfitting. This typically leads to systematic estimation biases and increased estimator variances. It is crucial for inference and prediction to quantify these effects correctly. Several methods have been proposed in literature to overcome overfitting bias or adjust estimates. The vast majority of these focus on the regression parameters. But failure to estimate correctly also the nuisance parameters may lead to significant errors in confidence statements and outcome prediction. In this paper we present a jacknife method for deriving a compact set of non-linear equations which describe the statistical properties of the ML estimator in the regime where $p=O(n)$ and under the hypothesis of normally distributed covariates. These equations enable one to compute the overfitting bias of maximum likelihood (ML) estimators in parametric regression models as functions of $\zeta = p/n$. We then use these equations to compute shrinkage factors in order to remove the overfitting bias of maximum likelihood (ML) estimators. This new derivation offers various benefits over the replica approach in terms of increased transparency and reduced assumptions. To illustrate the theory we performed simulation studies for multiple regression models. In all cases we find excellent agreement between theory and simulations.
翻译:基于多个协变量的回归分析日益普遍。然而,当协变量数量$p$与观测数量$n$处于同一量级时,最大似然回归会因过拟合而变得不可靠。这通常会导致系统性估计偏差和估计量方差的增大。正确量化这些效应对推断和预测至关重要。文献中已提出多种方法用于克服过拟合偏差或调整估计量。其中绝大多数方法聚焦于回归参数。但未能正确估计冗余参数也可能导致置信陈述和结果预测出现显著误差。本文提出一种刀切法,用于推导一组紧凑的非线性方程组,该方程组描述了在$p=O(n)$条件下且协变量服从正态分布假设时,最大似然(ML)估计量的统计性质。这些方程能够计算参数回归模型中最大似然估计量的过拟合偏差,并将其表示为$\zeta = p/n$的函数。随后我们利用这些方程计算收缩因子,以消除最大似然估计量的过拟合偏差。相较于复制方法,这一新推导在提高透明度和减少假设条件方面具有多重优势。为验证理论,我们针对多个回归模型进行了仿真研究。在所有案例中,理论与仿真结果均表现出极佳的一致性。