What property of the data distribution determines the excess risk of principal component analysis? In this paper, we provide a precise answer to this question. We establish a central limit theorem for the error of the principal subspace estimated by PCA, and derive the asymptotic distribution of its excess risk under the reconstruction loss. We obtain a non-asymptotic upper bound on the excess risk of PCA that recovers, in the large sample limit, our asymptotic characterization. Underlying our contributions is the following result: we prove that the negative block Rayleigh quotient, defined on the Grassmannian, is generalized self-concordant along geodesics emanating from its minimizer of maximum rotation less than $\pi/4$.
翻译:数据分布的何种性质决定了主成分分析的超额风险?本文针对该问题给出了精确解答。我们建立了主成分分析估计的主子空间误差的中心极限定理,并推导了其在重构损失下超额风险的渐近分布。我们获得了主成分分析超额风险的非渐近上界,该上界在大样本极限下恢复了我们的渐近刻画。我们贡献的基础是以下结果:我们证明了定义在格拉斯曼流形上的负分块瑞利商,在从其最大旋转小于 $\pi/4$ 的最小值点出发的测地线上是广义自协调的。