Machine-learning models demand for periodic updates to improve their average accuracy, exploiting novel architectures and additional data. However, a newly-updated model may commit mistakes that the previous model did not make. Such misclassifications are referred to as negative flips, and experienced by users as a regression of performance. In this work, we show that this problem also affects robustness to adversarial examples, thereby hindering the development of secure model update practices. In particular, when updating a model to improve its adversarial robustness, some previously-ineffective adversarial examples may become misclassified, causing a regression in the perceived security of the system. We propose a novel technique, named robustness-congruent adversarial training, to address this issue. It amounts to fine-tuning a model with adversarial training, while constraining it to retain higher robustness on the adversarial examples that were correctly classified before the update. We show that our algorithm and, more generally, learning with non-regression constraints, provides a theoretically-grounded framework to train consistent estimators. Our experiments on robust models for computer vision confirm that (i) both accuracy and robustness, even if improved after model update, can be affected by negative flips, and (ii) our robustness-congruent adversarial training can mitigate the problem, outperforming competing baseline methods.
翻译:机器学习模型需要周期性更新以提升其平均准确率,这通常借助新型架构和额外数据来实现。然而,更新后的模型可能会犯下原模型未曾出现的错误。此类误分类被称为"负翻转",用户会将其感知为性能退步。本研究揭示,该问题同样会影响对抗样本的鲁棒性,从而阻碍安全模型更新实践的发展。具体而言,当更新模型以增强其对抗鲁棒性时,部分原本无效的对抗样本可能转变为误分类样本,导致系统感知安全性出现退步。我们提出了一种名为"鲁棒一致性对抗训练"的新技术来解决该问题。该技术通过对抗训练对模型进行微调,同时约束模型在更新前已正确分类的对抗样本上保持更高的鲁棒性。研究表明,我们的算法——更广义而言,带非退步约束的学习范式——为训练一致估计量提供了理论依据。我们在计算机视觉鲁棒模型上的实验证实:(i) 准确率与鲁棒性即使在模型更新后得到提升,仍可能受负翻转影响;(ii) 我们的鲁棒一致性对抗训练能有效缓解该问题,性能优于对比基线方法。