Spectral methods have myriad applications in high-dimensional statistics and data science, and while previous works have primarily focused on $\ell_2$ or $\ell_{2,\infty}$ eigenvector and singular vector perturbation theory, in many settings these analyses fall short of providing the fine-grained guarantees required for various inferential tasks. In this paper we study statistical inference for linear functions of eigenvectors and principal components with a particular emphasis on the setting where gaps between eigenvalues may be extremely small relative to the corresponding spiked eigenvalue, a regime which has been oft-neglected in the literature. It has been previously established that linear functions of eigenvectors and principal components incur a non-negligible bias, so in this work we provide Berry-Esseen bounds for empirical linear forms and their debiased counterparts respectively in the matrix denoising model and the spiked principal component analysis model, both under Gaussian noise. Next, we propose data-driven estimators for the appropriate bias and variance quantities resulting in approximately valid confidence intervals, and we demonstrate our theoretical results through numerical simulations. We further apply our results to obtain distributional theory and confidence intervals for eigenvector entries, for which debiasing is not necessary. Crucially, our proposed confidence intervals and bias-correction procedures can all be computed directly from data without sample-splitting and are asymptotically valid under minimal assumptions on the eigengap and signal strength. Furthermore, our Berry-Esseen bounds clearly reflect the effects of both signal strength and eigenvalue closeness on the estimation and inference tasks.
翻译:谱方法在高维统计和数据科学中有着广泛应用,尽管已有工作主要关注$\ell_2$或$\ell_{2,\infty}$范数下的特征向量与奇异向量扰动理论,但在许多场景中,这些分析无法为各类推断任务提供所需的细粒度保证。本文研究特征向量与主成分线性函数的统计推断问题,特别关注特征值间隙相对于对应尖峰特征值可能极小的情形——这一领域在文献中常被忽视。已有研究证实,特征向量与主成分的线性函数存在不可忽略的偏差,因此我们在高斯噪声假设下,分别针对矩阵去噪模型和尖峰主成分分析模型,给出了经验线性形式及其去偏对应形式的Berry-Esseen界。其次,我们提出数据驱动的偏差与方差量估计方法,从而构建近似有效的置信区间,并通过数值模拟验证理论结果。进一步,我们将结论应用于特征向量分量的分布理论与置信区间构建,该情形无需去偏处理。关键在于,本文提出的置信区间与偏差校正程序均可直接基于数据计算而无需样本拆分,且在特征值间隙与信号强度的最小假设下具有渐近有效性。此外,我们的Berry-Esseen界清晰反映了信号强度与特征值接近程度对估计与推断任务的影响。