This paper studies robust nonparametric regression, in which an adversarial attacker can modify the values of up to $q$ samples from a training dataset of size $N$. Our initial solution is an M-estimator based on Huber loss minimization. Compared with simple kernel regression, i.e. the Nadaraya-Watson estimator, this method can significantly weaken the impact of malicious samples on the regression performance. We provide the convergence rate as well as the corresponding minimax lower bound. The result shows that, with proper bandwidth selection, $\ell_\infty$ error is minimax optimal. The $\ell_2$ error is optimal if $q\lesssim \sqrt{N/\ln^2 N}$, but is suboptimal with larger $q$. The reason is that this estimator is vulnerable if there are many attacked samples concentrating in a small region. To address this issue, we propose a correction method by projecting the initial estimate to the space of Lipschitz functions. The final estimate is nearly minimax optimal for arbitrary $q$, up to a $\ln N$ factor.
翻译:本文研究鲁棒性非参数回归问题,其中对抗性攻击者可以修改大小为 $N$ 的训练数据集中最多 $q$ 个样本的值。我们的初始解决方案是基于Huber损失最小化的M估计器。与简单的核回归(即Nadaraya-Watson估计器)相比,该方法能够显著削弱恶意样本对回归性能的影响。我们提供了收敛速率以及相应的极小化最大下界。结果表明,在适当选择带宽的情况下,$\ell_\infty$ 误差是极小化最大最优的。当 $q\lesssim \sqrt{N/\ln^2 N}$ 时,$\ell_2$ 误差是最优的,但当 $q$ 更大时,则是次优的。原因是该估计器在大量受攻击样本集中在一个小区域时较为脆弱。为解决此问题,我们提出了一种校正方法,将初始估计投影到Lipschitz函数空间。最终估计对于任意 $q$ 达到近极小化最大最优,仅相差一个 $\ln N$ 因子。