We study the fundamental problems of Gaussian mean estimation and linear regression with Gaussian covariates in the presence of Huber contamination. Our main contribution is the design of the first sample near-optimal and almost linear-time algorithms with optimal error guarantees for both of these problems. Specifically, for Gaussian robust mean estimation on $\mathbb{R}^d$ with contamination parameter $\epsilon \in (0, \epsilon_0)$ for a small absolute constant $\epsilon_0$, we give an algorithm with sample complexity $n = \tilde{O}(d/\epsilon^2)$ and almost linear runtime that approximates the target mean within $\ell_2$-error $O(\epsilon)$. This improves on prior work that achieved this error guarantee with polynomially suboptimal sample and time complexity. For robust linear regression, we give the first algorithm with sample complexity $n = \tilde{O}(d/\epsilon^2)$ and almost linear runtime that approximates the target regressor within $\ell_2$-error $O(\epsilon)$. This is the first polynomial sample and time algorithm achieving the optimal error guarantee, answering an open question in the literature. At the technical level, we develop a methodology that yields almost-linear time algorithms for multi-directional filtering that may be of broader interest.
翻译:我们研究在Huber污染环境下高斯均值估计和高斯协变量线性回归的基本问题。我们的主要贡献是为这两个问题设计了首个样本近优且近乎线性时间的算法,并实现了最优误差保证。具体而言,对于$\mathbb{R}^d$上污染参数$\epsilon \in (0, \epsilon_0)$(其中$\epsilon_0$为小的绝对常数)的高斯鲁棒均值估计,我们给出一个样本复杂度$n = \tilde{O}(d/\epsilon^2)$且近乎线性运行时间的算法,其能够以$\ell_2$误差$O(\epsilon)$逼近目标均值。这改进了先前以多项式次优的样本和时间复杂度实现相同误差保证的工作。对于鲁棒线性回归,我们给出首个样本复杂度$n = \tilde{O}(d/\epsilon^2)$且近乎线性运行时间的算法,其能够以$\ell_2$误差$O(\epsilon)$逼近目标回归器。这是首个实现最优误差保证的多项式样本和时间算法,回答了文献中的一个开放问题。在技术层面,我们开发了一种能实现多方向滤波的近乎线性时间算法的方法论,该方法可能具有更广泛的应用价值。