We analyze a practical algorithm for sparse PCA on incomplete and noisy data under a general non-random sampling scheme. The algorithm is based on a semidefinite relaxation of the $\ell_1$-regularized PCA problem. We provide theoretical justification that under certain conditions, we can recover the support of the sparse leading eigenvector with high probability by obtaining a unique solution. The conditions involve the spectral gap between the largest and second-largest eigenvalues of the true data matrix, the magnitude of the noise, and the structural properties of the observed entries. The concepts of algebraic connectivity and irregularity are used to describe the structural properties of the observed entries. We empirically justify our theorem with synthetic and real data analysis. We also show that our algorithm outperforms several other sparse PCA approaches especially when the observed entries have good structural properties. As a by-product of our analysis, we provide two theorems to handle a deterministic sampling scheme, which can be applied to other matrix-related problems.
翻译:我们分析了一种在一般非随机采样方案下,针对不完整且含噪声数据进行的稀疏主成分分析实用算法。该算法基于对ℓ1正则化PCA问题的半定松弛。我们提供了理论依据,证明在特定条件下,可通过获得唯一解来高概率恢复稀疏主特征向量的支撑集。这些条件涉及真实数据矩阵的最大与次大特征值之间的谱间隔、噪声幅度以及观测条目的结构性质。我们采用代数连通性和不规则性概念来描述观测条目的结构性质。通过合成数据与实际数据分析,我们实证验证了该定理。实验表明,当观测条目具有良好的结构性质时,我们的算法性能优于其他多种稀疏PCA方法。作为分析副产品,我们提供了两个适用于确定性采样方案的定理,可推广至其他矩阵相关问题。