This paper studies robust nonparametric regression, in which an adversarial attacker can modify the values of up to $q$ samples from a training dataset of size $N$. Our initial solution is an M-estimator based on Huber loss minimization. Compared with simple kernel regression, i.e. the Nadaraya-Watson estimator, this method can significantly weaken the impact of malicious samples on the regression performance. We provide the convergence rate as well as the corresponding minimax lower bound. The result shows that, with proper bandwidth selection, $\ell_\infty$ error is minimax optimal. The $\ell_2$ error is optimal with relatively small $q$, but is suboptimal with larger $q$. The reason is that this estimator is vulnerable if there are many attacked samples concentrating in a small region. To address this issue, we propose a correction method by projecting the initial estimate to the space of Lipschitz functions. The final estimate is nearly minimax optimal for arbitrary $q$, up to a $\ln N$ factor.
翻译:本文研究鲁棒非参数回归问题,其中恶意攻击者最多可修改训练数据集(规模为N)中q个样本的值。初始解法基于Huber损失最小化的M估计器。与简单核回归(即Nadaraya-Watson估计器)相比,该方法能显著降低恶意样本对回归性能的影响。我们给出了收敛速率及相应的极小极大下界。结果表明,在适当选择带宽的情况下,$\ell_\infty$误差达到极小极大最优;而$\ell_2$误差在q较小时为最优,但随q增大变为次优,其根源在于该估计器在面对集中分布于小区域的大量攻击样本时存在脆弱性。为此,我们提出通过将初始估计投影到Lipschitz函数空间进行修正的方法。最终估计对任意q值均达到近乎极小极大最优(仅相差$\ln N$因子)。