The $k$-principal component analysis ($k$-PCA) problem is a fundamental algorithmic primitive that is widely-used in data analysis and dimensionality reduction applications. In statistical settings, the goal of $k$-PCA is to identify a top eigenspace of the covariance matrix of a distribution, which we only have implicit access to via samples. Motivated by these implicit settings, we analyze black-box deflation methods as a framework for designing $k$-PCA algorithms, where we model access to the unknown target matrix via a black-box $1$-PCA oracle which returns an approximate top eigenvector, under two popular notions of approximation. Despite being arguably the most natural reduction-based approach to $k$-PCA algorithm design, such black-box methods, which recursively call a $1$-PCA oracle $k$ times, were previously poorly-understood. Our main contribution is significantly sharper bounds on the approximation parameter degradation of deflation methods for $k$-PCA. For a quadratic form notion of approximation we term ePCA (energy PCA), we show deflation methods suffer no parameter loss. For an alternative well-studied approximation notion we term cPCA (correlation PCA), we tightly characterize the parameter regimes where deflation methods are feasible. Moreover, we show that in all feasible regimes, $k$-cPCA deflation algorithms suffer no asymptotic parameter loss for any constant $k$. We apply our framework to obtain state-of-the-art $k$-PCA algorithms robust to dataset contamination, improving prior work both in sample complexity and approximation quality.
翻译:$k$ 主成分分析问题是数据分析和降维应用中广泛使用的基本算法原语。在统计环境中,$k$-PCA 的目标是识别分布协方差矩阵的顶部特征空间,而我们仅能通过样本隐式访问该矩阵。受这些隐式场景的启发,我们分析了作为设计 $k$-PCA 算法框架的黑盒压缩方法,其中我们通过黑盒 $1$-PCA 预言机(在两种流行的近似概念下返回近似顶部特征向量)来建模对未知目标矩阵的访问。尽管这可以说是 $k$-PCA 算法设计中最自然的基于约简的方法,但此类递归调用 $1$-PCA 预言机 $k$ 次的黑盒方法此前尚未被充分理解。我们的主要贡献在于显著更紧的 $k$-PCA 压缩方法近似参数退化界。对于我们称为 ePCA(能量 PCA)的二次型近似概念,我们证明压缩方法不会产生参数损失。对于另一种被充分研究的近似概念——我们称为 cPCA(相关 PCA),我们精确刻画了压缩方法可行的参数范围。此外,我们证明在所有可行范围内,对于任意常数 $k$,$k$-cPCA 压缩算法不会产生渐近参数损失。我们将我们的框架应用于获得对数据集污染具有鲁棒性的最先进 $k$-PCA 算法,在样本复杂度和近似质量上均优于先前工作。