Efficient Sparse PCA via Block-Diagonalization

Sparse Principal Component Analysis (Sparse PCA) is a pivotal tool in data analysis and dimensionality reduction. However, Sparse PCA is a challenging problem in both theory and practice: it is known to be NP-hard and current exact methods generally require exponential runtime. In this paper, we propose a novel framework to efficiently approximate Sparse PCA by (i) approximating the general input covariance matrix with a re-sorted block-diagonal matrix, (ii) solving the Sparse PCA sub-problem in each block, and (iii) reconstructing the solution to the original problem. Our framework is simple and powerful: it can leverage any off-the-shelf Sparse PCA algorithm and achieve significant computational speedups, with a minor additive error that is linear in the approximation error of the block-diagonal matrix. Suppose $g(k, d)$ is the runtime of an algorithm (approximately) solving Sparse PCA in dimension $d$ and with sparsity value $k$. Our framework, when integrated with this algorithm, reduces the runtime to $\mathcal{O}\left(\frac{d}{d^\star} \cdot g(k, d^\star) + d^2\right)$, where $d^\star \leq d$ is the largest block size of the block-diagonal matrix. For instance, integrating our framework with the Branch-and-Bound algorithm reduces the complexity from $g(k, d) = \mathcal{O}(k^3\cdot d^k)$ to $\mathcal{O}(k^3\cdot d \cdot (d^\star)^{k-1})$, demonstrating exponential speedups if $d^\star$ is small. We perform large-scale evaluations on many real-world datasets: for exact Sparse PCA algorithm, our method achieves an average speedup factor of 93.77, while maintaining an average approximation error of 2.15%; for approximate Sparse PCA algorithm, our method achieves an average speedup factor of 6.77 and an average approximation error of merely 0.37%.

翻译：稀疏主成分分析（Sparse PCA）是数据分析和降维中的关键工具。然而，稀疏主成分分析在理论和实践上都是一个具有挑战性的问题：已知它是NP难问题，且当前精确方法通常需要指数级运行时间。本文提出一种新颖框架，通过以下步骤高效逼近稀疏主成分分析：（i）用重排序的块对角矩阵逼近一般输入协方差矩阵，（ii）在每个块中求解稀疏主成分分析子问题，以及（iii）重构原始问题的解。我们的框架简单而强大：它可以利用任何现成的稀疏主成分分析算法，并实现显著的计算加速，其附加误差仅与块对角矩阵的逼近误差呈线性关系。假设 $g(k, d)$ 是（近似）求解维度 $d$、稀疏度 $k$ 的稀疏主成分分析算法的运行时间。我们的框架与该算法结合后，可将运行时间降低至 $\mathcal{O}\left(\frac{d}{d^\star} \cdot g(k, d^\star) + d^2\right)$，其中 $d^\star \leq d$ 是块对角矩阵的最大块尺寸。例如，将我们的框架与分支定界算法结合，可将复杂度从 $g(k, d) = \mathcal{O}(k^3\cdot d^k)$ 降低至 $\mathcal{O}(k^3\cdot d \cdot (d^\star)^{k-1})$，若 $d^\star$ 较小，则展现出指数级加速。我们在多个真实世界数据集上进行了大规模评估：对于精确稀疏主成分分析算法，我们的方法实现了平均93.77倍的加速，同时保持平均2.15%的逼近误差；对于近似稀疏主成分分析算法，我们的方法实现了平均6.77倍的加速，且平均逼近误差仅为0.37%。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日