We study the problem of high-dimensional sparse mean estimation in the presence of an $\epsilon$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For distributions on $\mathbb R^d$ with "certifiably bounded" $t$-th moments and sufficiently light tails, our algorithm achieves error of $O(\epsilon^{1-1/t})$ with sample complexity $m = (k\log(d))^{O(t)}/\epsilon^{2-2/t}$. For the special case of the Gaussian distribution, our algorithm achieves near-optimal error of $\tilde O(\epsilon)$ with sample complexity $m = O(k^4 \mathrm{polylog}(d))/\epsilon^2$. Our algorithms follow the Sum-of-Squares based, proofs to algorithms approach. We complement our upper bounds with Statistical Query and low-degree polynomial testing lower bounds, providing evidence that the sample-time-error tradeoffs achieved by our algorithms are qualitatively the best possible.
翻译:本研究探讨在高维稀疏均值估计问题中,存在ε比例对抗性异常值的情况。先前的研究针对协方差矩阵为单位矩阵的亚高斯分布,提出了样本高效且计算高效的算法。本文首次开发了无需先验协方差知识的鲁棒稀疏均值估计算法。对于定义在ℝ^d空间上具有"可证明有界"t阶矩且尾部充分衰减的分布,我们的算法在样本复杂度m = (k log(d))^{O(t)}/ε^{2-2/t}的条件下达到O(ε^{1-1/t})的误差界。特别对于高斯分布情形,算法在样本复杂度m = O(k⁴ polylog(d))/ε²的条件下实现了近乎最优的Õ(ε)误差。我们的算法遵循基于平方和(Sum-of-Squares)的"证明到算法"研究范式。我们通过统计查询(Statistical Query)与低阶多项式检测下界补充了上界分析,证明本算法达到的样本-时间-误差权衡在性质上是最优的。