Recent empirical and theoretical studies have established the generalization capabilities of large machine learning models that are trained to (approximately or exactly) fit noisy data. In this work, we prove a surprising result that even if the ground truth itself is robust to adversarial examples, and the benignly overfitted model is benign in terms of the ``standard'' out-of-sample risk objective, this benign overfitting process can be harmful when out-of-sample data are subject to adversarial manipulation. More specifically, our main results contain two parts: (i) the min-norm estimator in overparameterized linear model always leads to adversarial vulnerability in the ``benign overfitting'' setting; (ii) we verify an asymptotic trade-off result between the standard risk and the ``adversarial'' risk of every ridge regression estimator, implying that under suitable conditions these two items cannot both be small at the same time by any single choice of the ridge regularization parameter. Furthermore, under the lazy training regime, we demonstrate parallel results on two-layer neural tangent kernel (NTK) model, which align with empirical observations in deep neural networks. Our finding provides theoretical insights into the puzzling phenomenon observed in practice, where the true target function (e.g., human) is robust against adverasrial attack, while beginly overfitted neural networks lead to models that are not robust.
翻译:近期实证与理论研究已证实,大型机器学习模型在(近似或精确)拟合含噪数据时具备泛化能力。本文证明了一个令人惊讶的结论:即便真实标签本身对抗性样本具有鲁棒性,且良性过拟合模型在"标准"样本外风险指标上表现良好,但当样本外数据遭受对抗性操纵时,这种良性过拟合过程可能产生危害。具体而言,我们的主要结果包含两部分:(i)过参数化线性模型中的最小范数估计量在"良性过拟合"设定下始终导致对抗脆弱性;(ii)我们验证了岭回归估计量标准风险与"对抗"风险之间的渐近权衡关系,表明在适当条件下,无法通过单一岭正则化参数的选择使两项指标同时保持较小值。此外,在惰性训练机制下,我们在两层神经正切核(NTK)模型上证实了与深度神经网络实证观察相一致的平行结论。本研究为实践中观察到的悖论现象提供了理论解释:真实目标函数(如人类判断)对对抗攻击具有鲁棒性,而良性过拟合的神经网络却导致模型丧失鲁棒性。