Sparse Principal Component Analysis (Sparse PCA) is a pivotal tool in data analysis and dimensionality reduction. However, Sparse PCA is a challenging problem in both theory and practice: it is known to be NP-hard and current exact methods generally require exponential runtime. In this paper, we propose a novel framework to efficiently approximate Sparse PCA by (i) approximating the general input covariance matrix with a re-sorted block-diagonal matrix, (ii) solving the Sparse PCA sub-problem in each block, and (iii) reconstructing the solution to the original problem. Our framework is simple and powerful: it can leverage any off-the-shelf Sparse PCA algorithm and achieve significant computational speedups, with a minor additive error that is linear in the approximation error of the block-diagonal matrix. Suppose $g(k, d)$ is the runtime of an algorithm (approximately) solving Sparse PCA in dimension $d$ and with sparsity constant $k$. Our framework, when integrated with this algorithm, reduces the runtime to $\mathcal{O}\left(\frac{d}{d^\star} \cdot g(k, d^\star) + d^2\right)$, where $d^\star \leq d$ is the largest block size of the block-diagonal matrix. For instance, integrating our framework with the Branch-and-Bound algorithm reduces the complexity from $g(k, d) = \mathcal{O}(k^3\cdot d^k)$ to $\mathcal{O}(k^3\cdot d \cdot (d^\star)^{k-1})$, demonstrating exponential speedups if $d^\star$ is small. We perform large-scale evaluations on many real-world datasets: for exact Sparse PCA algorithm, our method achieves an average speedup factor of 100.50, while maintaining an average approximation error of 0.61%; for approximate Sparse PCA algorithm, our method achieves an average speedup factor of 6.00 and an average approximation error of -0.91%, meaning that our method oftentimes finds better solutions.
翻译:稀疏主成分分析(Sparse PCA)是数据分析和降维中的关键工具。然而,稀疏主成分分析在理论和实践上都是一个具有挑战性的问题:已知其为NP难问题,且当前精确方法通常需要指数级运行时间。本文提出一种新颖框架,通过以下步骤高效近似求解稀疏主成分分析:(i)用重排序的块对角矩阵近似一般输入协方差矩阵,(ii)在每个块内求解稀疏主成分分析子问题,以及(iii)重构原始问题的解。我们的框架简洁而强大:它可以利用任何现成的稀疏主成分分析算法,并实现显著的计算加速,其附加误差仅为块对角矩阵近似误差的线性函数。假设$g(k, d)$是(近似)求解维度为$d$、稀疏常数为$k$的稀疏主成分分析算法的运行时间。当我们的框架与该算法集成时,可将运行时间降低至$\mathcal{O}\left(\frac{d}{d^\star} \cdot g(k, d^\star) + d^2\right)$,其中$d^\star \leq d$是块对角矩阵的最大块大小。例如,将我们的框架与分支定界算法集成,可将复杂度从$g(k, d) = \mathcal{O}(k^3\cdot d^k)$降低至$\mathcal{O}(k^3\cdot d \cdot (d^\star)^{k-1})$,若$d^\star$较小,则实现了指数级加速。我们在多个真实世界数据集上进行了大规模评估:对于精确稀疏主成分分析算法,我们的方法实现了平均100.50倍的加速因子,同时保持了0.61%的平均近似误差;对于近似稀疏主成分分析算法,我们的方法实现了平均6.00倍的加速因子和-0.91%的平均近似误差,这意味着我们的方法通常能找到更优解。