We study the algorithmic problem of sparse mean estimation in the presence of adversarial outliers. Specifically, the algorithm observes a \emph{corrupted} set of samples from $\mathcal{N}(\mu,\mathbf{I}_d)$, where the unknown mean $\mu \in \mathbb{R}^d$ is constrained to be $k$-sparse. A series of prior works has developed efficient algorithms for robust sparse mean estimation with sample complexity $\mathrm{poly}(k,\log d, 1/\epsilon)$ and runtime $d^2 \mathrm{poly}(k,\log d,1/\epsilon)$, where $\epsilon$ is the fraction of contamination. In particular, the fastest runtime of existing algorithms is quadratic ($\Omega(d^2)$), which can be prohibitive in high dimensions. This quadratic barrier in the runtime stems from the reliance of these algorithms on the sample covariance matrix, which is of size $d^2$. Our main contribution is an algorithm for robust sparse mean estimation which runs in \emph{subquadratic} time using $\mathrm{poly}(k,\log d,1/\epsilon)$ samples. We also provide analogous results for robust sparse PCA. Our results build on algorithmic advances in detecting weak correlations, a generalized version of the light-bulb problem by Valiant.
翻译:本文研究存在对抗性异常值情况下的稀疏均值估计算法问题。具体而言,算法观测到来自$\mathcal{N}(\mu,\mathbf{I}_d)$的一组被污染的样本,其中未知均值$\mu \in \mathbb{R}^d$受限于为$k$-稀疏。此前一系列工作已开发出高效的稳健稀疏均值估计算法,其样本复杂度为$\mathrm{poly}(k,\log d, 1/\epsilon)$,运行时间为$d^2 \mathrm{poly}(k,\log d,1/\epsilon)$,其中$\epsilon$为污染比例。值得注意的是,现有算法的最快运行时间为二次型($\Omega(d^2)$),在高维场景下可能难以承受。这种运行时间的二次障碍源于这些算法依赖大小为$d^2$的样本协方差矩阵。我们的主要贡献是提出一种稳健稀疏均值估计算法,该算法在$\mathrm{poly}(k,\log d,1/\epsilon)$个样本下实现\textit{亚二次}时间复杂度。同时,我们为稳健稀疏主成分分析提供了类似结果。本研究建立在检测弱相关性的算法进展之上,该问题源自Valiant提出的广义灯泡问题。