The $k$-principal component analysis ($k$-PCA) problem is a fundamental algorithmic primitive that is widely-used in data analysis and dimensionality reduction applications. In statistical settings, the goal of $k$-PCA is to identify a top eigenspace of the covariance matrix of a distribution, which we only have black-box access to via samples. Motivated by these settings, we analyze black-box deflation methods as a framework for designing $k$-PCA algorithms, where we model access to the unknown target matrix via a black-box $1$-PCA oracle which returns an approximate top eigenvector, under two popular notions of approximation. Despite being arguably the most natural reduction-based approach to $k$-PCA algorithm design, such black-box methods, which recursively call a $1$-PCA oracle $k$ times, were previously poorly-understood. Our main contribution is significantly sharper bounds on the approximation parameter degradation of deflation methods for $k$-PCA. For a quadratic form notion of approximation we term ePCA (energy PCA), we show deflation methods suffer no parameter loss. For an alternative well-studied approximation notion we term cPCA (correlation PCA), we tightly characterize the parameter regimes where deflation methods are feasible. Moreover, we show that in all feasible regimes, $k$-cPCA deflation algorithms suffer no asymptotic parameter loss for any constant $k$. We apply our framework to obtain state-of-the-art $k$-PCA algorithms robust to dataset contamination, improving prior work in sample complexity by a $\mathsf{poly}(k)$ factor.
翻译:$k$主成分分析($k$-PCA)问题是数据分析和降维应用中广泛使用的基本算法原语。在统计设定中,$k$-PCA的目标是识别一个分布的协方差矩阵的顶部特征空间,而我们仅能通过样本以黑盒方式访问该分布。受这些设定启发,我们分析黑盒收缩方法作为设计$k$-PCA算法的框架,其中我们通过黑盒$1$-PCA预言机来建模对未知目标矩阵的访问,该预言机在两种流行的近似概念下返回近似顶部特征向量。尽管这类递归调用$1$-PCA预言机$k$次的黑盒方法可以说是设计$k$-PCA算法最自然的基于约简的途径,但此前人们对它们的理解十分有限。我们的主要贡献是显著提升了关于$k$-PCA收缩方法近似参数退化程度的界。对于我们称为ePCA(能量PCA)的二次型近似概念,我们证明收缩方法不会产生参数损失。对于另一种我们称为cPCA(相关PCA)的已被深入研究的近似概念,我们严格刻画了收缩方法可行的参数区域。此外,我们证明在所有可行区域中,$k$-cPCA收缩算法对于任意常数$k$均不会产生渐近参数损失。我们应用所提出的框架,得到了针对数据集污染具有鲁棒性的最先进$k$-PCA算法,将先前工作的样本复杂度改进了$\mathsf{poly}(k)$倍。