Recent empirical and theoretical studies have established the generalization capabilities of large machine learning models that are trained to (approximately or exactly) fit noisy data. In this work, we prove a surprising result that even if the ground truth itself is robust to adversarial examples, and the benignly overfitted model is benign in terms of the ``standard'' out-of-sample risk objective, this benign overfitting process can be harmful when out-of-sample data are subject to adversarial manipulation. More specifically, our main results contain two parts: (i) the min-norm estimator in overparameterized linear model always leads to adversarial vulnerability in the ``benign overfitting'' setting; (ii) we verify an asymptotic trade-off result between the standard risk and the ``adversarial'' risk of every ridge regression estimator, implying that under suitable conditions these two items cannot both be small at the same time by any single choice of the ridge regularization parameter. Furthermore, under the lazy training regime, we demonstrate parallel results on two-layer neural tangent kernel (NTK) model, which align with empirical observations in deep neural networks. Our finding provides theoretical insights into the puzzling phenomenon observed in practice, where the true target function (e.g., human) is robust against adverasrial attack, while beginly overfitted neural networks lead to models that are not robust.
翻译:近期实证与理论研究已证实,大规模机器学习模型在(近似或精确)拟合含噪数据时具备泛化能力。本研究发现了一个令人意外的结果:即使真实标签本身对对抗样本具有鲁棒性,且良性过拟合模型在"标准"样本外风险指标上表现良好,该良性过拟合过程仍可能在样本外数据遭受对抗性操控时产生危害。具体而言,我们的主要结论包含两部分:(i)过参数化线性模型中的最小范数估计量始终导致"良性过拟合"场景下的对抗脆弱性;(ii)我们验证了岭回归估计量标准风险与"对抗性"风险之间的渐进权衡关系,表明在合适条件下,任何单一岭正则化参数选择均无法使这两项指标同时达到较小值。此外,在懒训练机制下,我们展示了双层神经切线核(NTK)模型中的平行结论,该结论与深度神经网络的实证观察高度吻合。我们的研究为实践中观察到的悖论现象提供了理论洞见:真实目标函数(例如人类)对对抗攻击具有鲁棒性,而良性过拟合神经网络却会产生非鲁棒模型。