The $k$-principal component analysis ($k$-PCA) problem is a fundamental algorithmic primitive that is widely-used in data analysis and dimensionality reduction applications. In statistical settings, the goal of $k$-PCA is to identify a top eigenspace of the covariance matrix of a distribution, which we only have implicit access to via samples. Motivated by these implicit settings, we analyze black-box deflation methods as a framework for designing $k$-PCA algorithms, where we model access to the unknown target matrix via a black-box $1$-PCA oracle which returns an approximate top eigenvector, under two popular notions of approximation. Despite being arguably the most natural reduction-based approach to $k$-PCA algorithm design, such black-box methods, which recursively call a $1$-PCA oracle $k$ times, were previously poorly-understood. Our main contribution is significantly sharper bounds on the approximation parameter degradation of deflation methods for $k$-PCA. For a quadratic form notion of approximation we term ePCA (energy PCA), we show deflation methods suffer no parameter loss. For an alternative well-studied approximation notion we term cPCA (correlation PCA), we tightly characterize the parameter regimes where deflation methods are feasible. Moreover, we show that in all feasible regimes, $k$-cPCA deflation algorithms suffer no asymptotic parameter loss for any constant $k$. We apply our framework to obtain state-of-the-art $k$-PCA algorithms robust to dataset contamination, improving prior work both in sample complexity and approximation quality.
翻译:$k$-主成分分析($k$-PCA)问题是一种基础算法原语,广泛应用于数据分析和降维场景。在统计设定中,$k$-PCA的目标是识别分布协方差矩阵的顶部特征空间,而该分布仅能通过样本隐式访问。受这些隐式设定的启发,我们分析了一种黑盒放气法框架,用于设计$k$-PCA算法。在该框架中,我们通过一个黑盒$1$-PCA神谕(返回近似顶部特征向量)来建模对未知目标矩阵的访问,并考虑两种流行的近似概念。尽管这类递归调用$1$-PCA神谕$k$次的黑盒方法可被视为最自然的基于归约的$k$-PCA算法设计途径,但其性质此前尚未得到充分理解。我们的主要贡献在于显著优化了$k$-PCA放气法中近似参数退化的界。针对一种我们称为ePCA(能量PCA)的二次型近似概念,我们证明放气法不会产生参数损失;针对另一种被广泛研究的近似概念cPCA(相关性PCA),我们精确刻画了放气法可行的参数区间。此外,我们证明在任意可行区间内,$k$-cPCA放气算法对任意常数$k$均无渐近参数损失。我们应用此框架获得了对数据集污染具有鲁棒性的最优$k$-PCA算法,在样本复杂度和近似质量上均超越了先前工作。