Recent years have seen substantial advances in our understanding of high-dimensional ridge regression, but existing theories assume that training examples are independent. By leveraging techniques from random matrix theory and free probability, we provide sharp asymptotics for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations. We demonstrate that in this setting, the generalized cross validation estimator (GCV) fails to correctly predict the out-of-sample risk. However, in the case where the noise residuals have the same correlations as the data points, one can modify the GCV to yield an efficiently-computable unbiased estimator that concentrates in the high-dimensional limit, which we dub CorrGCV. We further extend our asymptotic analysis to the case where the test point has nontrivial correlations with the training set, a setting often encountered in time series forecasting. Assuming knowledge of the correlation structure of the time series, this again yields an extension of the GCV estimator, and sharply characterizes the degree to which such test points yield an overly optimistic prediction of long-time risk. We validate the predictions of our theory across a variety of high dimensional data.
翻译:近年来,我们对高维岭回归的理解取得了显著进展,但现有理论通常假设训练样本是独立的。通过利用随机矩阵理论和自由概率的技巧,我们提供了在数据点具有任意相关性时,岭回归样本内和样本外风险的精确渐近特性。我们证明,在这种情况下,广义交叉验证估计量(GCV)无法正确预测样本外风险。然而,当噪声残差与数据点具有相同相关性时,可以对GCV进行修改,从而得到一个在高效计算且在高维极限下集中的无偏估计量,我们将其命名为CorrGCV。我们进一步将渐近分析扩展到测试点与训练集具有非平凡相关性的情况,这是时间序列预测中常见的场景。假设已知时间序列的相关结构,这再次给出了GCV估计量的扩展,并精确刻画了此类测试点对长期风险产生过于乐观预测的程度。我们在各种高维数据上验证了我们的理论预测。