We study principal components regression (PCR) in an asymptotic high-dimensional regression setting, where the number of data points is proportional to the dimension. We derive exact limiting formulas for the estimation and prediction risks, which depend in a complicated manner on the eigenvalues of the population covariance, the alignment between the population PCs and the true signal, and the number of selected PCs. A key challenge in the high-dimensional setting stems from the fact that the sample covariance is an inconsistent estimate of its population counterpart, so that sample PCs may fail to fully capture potential latent low-dimensional structure in the data. We demonstrate this point through several case studies, including that of a spiked covariance model. To calculate the asymptotic prediction risk, we leverage tools from random matrix theory which to our knowledge have not seen much use to date in the statistics literature: multi-resolvent traces and their associated eigenvector overlap measures.
翻译:本研究探讨了主成分回归(PCR)在高维回归渐近框架下的表现,其中数据点数量与维度成比例。我们推导了估计风险与预测风险的精确极限公式,这些公式以复杂的方式依赖于总体协方差矩阵的特征值、总体主成分与真实信号的对齐程度,以及所选主成分的数量。高维场景下的核心挑战源于样本协方差矩阵是其总体对应量的非一致估计,导致样本主成分可能无法充分捕捉数据中潜在的隐式低维结构。我们通过多个案例研究(包括尖峰协方差模型)论证了这一观点。为计算渐近预测风险,我们运用了随机矩阵理论中的工具——多解析迹及其相关的特征向量重叠测度,据我们所知,这些工具目前在统计学文献中尚未得到广泛应用。