Principal component analysis (PCA) is a simple and popular tool for processing high-dimensional data. We investigate its effectiveness for matrix denoising. We consider the clean data are generated from a low-dimensional subspace, but masked by independent high-dimensional sub-Gaussian noises with standard deviation $\sigma$. Under the low-rank assumption on the clean data with a mild spectral gap assumption, we prove that the distance between each pair of PCA-denoised data point and the clean data point is uniformly bounded by $O(\sigma \log n)$. To illustrate the spectral gap assumption, we show it can be satisfied when the clean data are independently generated with a non-degenerate covariance matrix. We then provide a general lower bound for the error of the denoised data matrix, which indicates PCA denoising gives a uniform error bound that is rate-optimal. Furthermore, we examine how the error bound impacts downstream applications such as clustering and manifold learning. Numerical results validate our theoretical findings and reveal the importance of the uniform error.
翻译:主成分分析(PCA)是一种处理高维数据的简单而流行的工具。我们研究了其在矩阵去噪中的有效性。考虑干净数据由低维子空间生成,但被标准差为$\sigma$的独立高维次高斯噪声所掩盖。在干净数据具有低秩假设和温和谱间隙假设的条件下,我们证明每个经PCA去噪的数据点与干净数据点之间的距离一致有界于$O(\sigma \log n)$。为阐明谱间隙假设,我们证明当干净数据由非退化协方差矩阵独立生成时该假设成立。随后我们给出去噪数据矩阵误差的通用下界,表明PCA去噪能够提供率达到最优的一致误差界。进一步地,我们考察了该误差界对聚类和流形学习等下游任务的影响。数值实验结果验证了我们的理论发现并揭示了均匀误差的重要性。