In the Wishart model for sparse PCA we are given $n$ samples $Y_1,\ldots, Y_n$ drawn independently from a $d$-dimensional Gaussian distribution $N({0, Id + \beta vv^\top})$, where $\beta > 0$ and $v\in \mathbb{R}^d$ is a $k$-sparse unit vector, and we wish to recover $v$ (up to sign). We show that if $n \ge \Omega(d)$, then for every $t \ll k$ there exists an algorithm running in time $n\cdot d^{O(t)}$ that solves this problem as long as \[ \beta \gtrsim \frac{k}{\sqrt{nt}}\sqrt{\ln({2 + td/k^2})}\,. \] Prior to this work, the best polynomial time algorithm in the regime $k\approx \sqrt{d}$, called \emph{Covariance Thresholding} (proposed in [KNV15a] and analyzed in [DM14]), required $\beta \gtrsim \frac{k}{\sqrt{n}}\sqrt{\ln({2 + d/k^2})}$. For large enough constant $t$ our algorithm runs in polynomial time and has better guarantees than Covariance Thresholding. Previously known algorithms with such guarantees required quasi-polynomial time $d^{O(\log d)}$. In addition, we show that our techniques work with sparse PCA with adversarial perturbations studied in [dKNS20]. This model generalizes not only sparse PCA, but also other problems studied in prior works, including the sparse planted vector problem. As a consequence, we provide polynomial time algorithms for the sparse planted vector problem that have better guarantees than the state of the art in some regimes. Our approach also works with the Wigner model for sparse PCA. Moreover, we show that it is possible to combine our techniques with recent results on sparse PCA with symmetric heavy-tailed noise [dNNS22]. In particular, in the regime $k \approx \sqrt{d}$ we get the first polynomial time algorithm that works with symmetric heavy-tailed noise, while the algorithm from [dNNS22]. requires quasi-polynomial time in these settings.
翻译:在稀疏主成分分析(sparse PCA)的Wishart模型中,我们给定$n$个样本$Y_1,\ldots, Y_n$,这些样本独立取自$d$维高斯分布$N({0, Id + \beta vv^\top})$,其中$\beta > 0$且$v\in \mathbb{R}^d$是$k$-稀疏单位向量,我们的目标是恢复$v$(可至符号差)。我们证明:若$n \ge \Omega(d)$,则对每个$t \ll k$,存在一个运行时间为$n\cdot d^{O(t)}$的算法,能在以下条件下解决该问题:\[ \beta \gtrsim \frac{k}{\sqrt{nt}}\sqrt{\ln({2 + td/k^2})}\,. \]在此之前,在$k\approx \sqrt{d}$情形中,最好的多项式时间算法——称为“协方差阈值化”(Covariance Thresholding,由[KNV15a]提出,[DM14]分析)——要求$\beta \gtrsim \frac{k}{\sqrt{n}}\sqrt{\ln({2 + d/k^2})}$。当常数$t$足够大时,我们的算法运行时间为多项式时间,且其保证优于协方差阈值化。此前具有类似保证的算法需要拟多项式时间$d^{O(\log d)}$。此外,我们证明该方法适用于[dKNS20]中研究的带对抗扰动的稀疏PCA。该模型不仅推广了稀疏PCA,还推广了先前工作研究的其他问题,包括稀疏植入向量问题(sparse planted vector problem)。因此,我们为稀疏植入向量问题提供了多项式时间算法,在某些情形下其保证优于现有最优方法。我们的方法同样适用于稀疏PCA的Wigner模型。进一步,我们证明可将其与最近关于对称重尾噪声下稀疏PCA的研究结果[dNNS22]相结合。特别地,在$k \approx \sqrt{d}$情形中,我们首次获得了能在对称重尾噪声下运行的多项式时间算法,而[dNNS22]中的算法在此设置下需要拟多项式时间。