In the Wishart model for sparse PCA we are given $n$ samples $Y_1,\ldots, Y_n$ drawn independently from a $d$-dimensional Gaussian distribution $N({0, Id + \beta vv^\top})$, where $\beta > 0$ and $v\in \mathbb{R}^d$ is a $k$-sparse unit vector, and we wish to recover $v$ (up to sign). We show that if $n \ge \Omega(d)$, then for every $t \ll k$ there exists an algorithm running in time $n\cdot d^{O(t)}$ that solves this problem as long as \[ \beta \gtrsim \frac{k}{\sqrt{nt}}\sqrt{\ln({2 + td/k^2})}\,. \] Prior to this work, the best polynomial time algorithm in the regime $k\approx \sqrt{d}$, called \emph{Covariance Thresholding} (proposed in [KNV15a] and analyzed in [DM14]), required $\beta \gtrsim \frac{k}{\sqrt{n}}\sqrt{\ln({2 + d/k^2})}$. For large enough constant $t$ our algorithm runs in polynomial time and has better guarantees than Covariance Thresholding. Previously known algorithms with such guarantees required quasi-polynomial time $d^{O(\log d)}$. In addition, we show that our techniques work with sparse PCA with adversarial perturbations studied in [dKNS20]. This model generalizes not only sparse PCA, but also other problems studied in prior works, including the sparse planted vector problem. As a consequence, we provide polynomial time algorithms for the sparse planted vector problem that have better guarantees than the state of the art in some regimes. Our approach also works with the Wigner model for sparse PCA. Moreover, we show that it is possible to combine our techniques with recent results on sparse PCA with symmetric heavy-tailed noise [dNNS22]. In particular, in the regime $k \approx \sqrt{d}$ we get the first polynomial time algorithm that works with symmetric heavy-tailed noise, while the algorithm from [dNNS22]. requires quasi-polynomial time in these settings.
翻译:在稀疏主成分分析的Wishart模型中,我们获得$n$个独立样本$Y_1,\ldots, Y_n$,它们来自$d$维高斯分布$N({0, Id + \beta vv^\top})$,其中$\beta > 0$且$v\in \mathbb{R}^d$为$k$-稀疏单位向量。我们的目标是恢复$v$(符号可忽略)。我们证明:若$n \ge \Omega(d)$,则对每个$t \ll k$,存在一个运行时间为$n\cdot d^{O(t)}$的算法,可在满足下式时解决该问题:\[ \beta \gtrsim \frac{k}{\sqrt{nt}}\sqrt{\ln({2 + td/k^2})}\,. \] 此前,在$k\approx \sqrt{d}$情形下最优的多项式时间算法为“协方差阈值法”([KNV15a]提出,[DM14]分析),其要求$\beta \gtrsim \frac{k}{\sqrt{n}}\sqrt{\ln({2 + d/k^2})}$。对于足够大的常数$t$,我们的算法在多项式时间内运行且优于协方差阈值法。已知具备相同保证的算法需拟多项式时间$d^{O(\log d)}$。此外,我们证明该技术可处理[dKNS20]研究的带对抗扰动的稀疏主成分分析。该模型不仅推广了稀疏主成分分析,还涵盖了先前工作中的其他问题(包括稀疏植入向量问题)。由此,我们为稀疏植入向量问题提供了多项式时间算法,在某些情形下其保证优于现有最优方法。我们的方法同样适用于稀疏主成分分析的Wigner模型。进一步,我们证明可将该技术与对称重尾噪声下稀疏主成分分析的最新成果[dNNS22]相结合。特别地,在$k \approx \sqrt{d}$情形下,我们首次得到可处理对称重尾噪声的多项式时间算法,而[dNNS22]的算法在这些设定下需拟多项式时间。