Sparse PCA Beyond Covariance Thresholding

In the Wishart model for sparse PCA we are given $n$ samples $Y_1,\ldots, Y_n$ drawn independently from a $d$-dimensional Gaussian distribution $N({0, Id + \beta vv^\top})$, where $\beta > 0$ and $v\in \mathbb{R}^d$ is a $k$-sparse unit vector, and we wish to recover $v$ (up to sign). We show that if $n \ge \Omega(d)$, then for every $t \ll k$ there exists an algorithm running in time $n\cdot d^{O(t)}$ that solves this problem as long as \[ \beta \gtrsim \frac{k}{\sqrt{nt}}\sqrt{\ln({2 + td/k^2})}\,. \] Prior to this work, the best polynomial time algorithm in the regime $k\approx \sqrt{d}$, called \emph{Covariance Thresholding} (proposed in [KNV15a] and analyzed in [DM14]), required $\beta \gtrsim \frac{k}{\sqrt{n}}\sqrt{\ln({2 + d/k^2})}$. For large enough constant $t$ our algorithm runs in polynomial time and has better guarantees than Covariance Thresholding. Previously known algorithms with such guarantees required quasi-polynomial time $d^{O(\log d)}$. In addition, we show that our techniques work with sparse PCA with adversarial perturbations studied in [dKNS20]. This model generalizes not only sparse PCA, but also other problems studied in prior works, including the sparse planted vector problem. As a consequence, we provide polynomial time algorithms for the sparse planted vector problem that have better guarantees than the state of the art in some regimes. Our approach also works with the Wigner model for sparse PCA. Moreover, we show that it is possible to combine our techniques with recent results on sparse PCA with symmetric heavy-tailed noise [dNNS22]. In particular, in the regime $k \approx \sqrt{d}$ we get the first polynomial time algorithm that works with symmetric heavy-tailed noise, while the algorithm from [dNNS22]. requires quasi-polynomial time in these settings.

翻译：在稀疏主成分分析的Wishart模型中，我们获得$n$个独立样本$Y_1,\ldots, Y_n$，它们来自$d$维高斯分布$N({0, Id + \beta vv^\top})$，其中$\beta > 0$且$v\in \mathbb{R}^d$为$k$-稀疏单位向量。我们的目标是恢复$v$（符号可忽略）。我们证明：若$n \ge \Omega(d)$，则对每个$t \ll k$，存在一个运行时间为$n\cdot d^{O(t)}$的算法，可在满足下式时解决该问题：\[ \beta \gtrsim \frac{k}{\sqrt{nt}}\sqrt{\ln({2 + td/k^2})}\,. \] 此前，在$k\approx \sqrt{d}$情形下最优的多项式时间算法为“协方差阈值法”（[KNV15a]提出，[DM14]分析），其要求$\beta \gtrsim \frac{k}{\sqrt{n}}\sqrt{\ln({2 + d/k^2})}$。对于足够大的常数$t$，我们的算法在多项式时间内运行且优于协方差阈值法。已知具备相同保证的算法需拟多项式时间$d^{O(\log d)}$。此外，我们证明该技术可处理[dKNS20]研究的带对抗扰动的稀疏主成分分析。该模型不仅推广了稀疏主成分分析，还涵盖了先前工作中的其他问题（包括稀疏植入向量问题）。由此，我们为稀疏植入向量问题提供了多项式时间算法，在某些情形下其保证优于现有最优方法。我们的方法同样适用于稀疏主成分分析的Wigner模型。进一步，我们证明可将该技术与对称重尾噪声下稀疏主成分分析的最新成果[dNNS22]相结合。特别地，在$k \approx \sqrt{d}$情形下，我们首次得到可处理对称重尾噪声的多项式时间算法，而[dNNS22]的算法在这些设定下需拟多项式时间。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日