This paper studies how to construct confidence regions for principal component analysis (PCA) in high dimension, a problem that has been vastly under-explored. While computing measures of uncertainty for nonlinear/nonconvex estimators is in general difficult in high dimension, the challenge is further compounded by the prevalent presence of missing data and heteroskedastic noise. We propose a novel approach to performing valid inference on the principal subspace under a spiked covariance model with missing data, on the basis of an estimator called HeteroPCA (Zhang et al., 2022). We develop non-asymptotic distributional guarantees for HeteroPCA, and demonstrate how these can be invoked to compute both confidence regions for the principal subspace and entrywise confidence intervals for the spiked covariance matrix. Our inference procedures are fully data-driven and adaptive to heteroskedastic random noise, without requiring prior knowledge about the noise levels.
翻译:本文研究如何在高维情形下为主成分分析(PCA)构建置信区域,这个问题至今尚未得到充分探索。尽管在高维空间中为非线性/非凸估计量计算不确定性度量通常较为困难,但缺失数据和异方差噪声的普遍存在进一步加剧了这一挑战。我们提出一种新方法,基于一种名为HeteroPCA的估计量(Zhang et al., 2022),在带有缺失数据的尖峰协方差模型下对主子空间进行有效推断。我们推导了HeteroPCA的非渐近分布保证,并展示了如何利用这些保证来同时计算主子空间的置信区域以及尖峰协方差矩阵的逐元素置信区间。我们的推断流程完全由数据驱动,能够自适应异方差随机噪声,且无需事先了解噪声水平。