We study principal components regression (PCR) in an asymptotic high-dimensional regression setting, where the number of data points is proportional to the dimension. We derive exact limiting formulas for the estimation and prediction risks, which depend in a complicated manner on the eigenvalues of the population covariance, the alignment between the population PCs and the true signal, and the number of selected PCs. A key challenge in the high-dimensional setting stems from the fact that the sample covariance is an inconsistent estimate of its population counterpart, so that sample PCs may fail to fully capture potential latent low-dimensional structure in the data. We demonstrate this point through several case studies, including that of a spiked covariance model. To calculate the asymptotic prediction risk, we leverage tools from random matrix theory which to our knowledge have not seen much use to date in the statistics literature: multi-resolvent traces and their associated eigenvector overlap measures.
翻译:我们研究了在高维回归渐近框架下的主成分回归(PCR),其中数据点数量与维度成比例。我们推导了估计风险与预测风险的精确极限公式,这些公式以复杂的方式取决于总体协方差矩阵的特征值、总体主成分与真实信号之间的对齐程度以及所选主成分的数量。高维环境中的一个关键挑战源于样本协方差矩阵是总体协方差的不一致估计量,因此样本主成分可能无法完全捕捉数据中潜在的潜在低维结构。我们通过若干案例研究(包括尖峰协方差模型)证明了这一观点。为计算渐近预测风险,我们利用了随机矩阵理论中的工具——多预解迹及其相关特征向量重叠度量——据我们所知,这些工具迄今在统计学文献中尚未得到广泛应用。