While prior research has proposed a plethora of methods that enhance the adversarial robustness of neural classifiers, practitioners are still reluctant to adopt these techniques due to their unacceptably severe penalties in clean accuracy. This paper shows that by mixing the output probabilities of a standard classifier and a robust model, where the standard network is optimized for clean accuracy and is not robust in general, this accuracy-robustness trade-off can be significantly alleviated. We show that the robust base classifier's confidence difference for correct and incorrect examples is the key ingredient of this improvement. In addition to providing intuitive and empirical evidence, we also theoretically certify the robustness of the mixed classifier under realistic assumptions. Furthermore, we adapt an adversarial input detector into a mixing network that adaptively adjusts the mixture of the two base models, further reducing the accuracy penalty of achieving robustness. The proposed flexible method, termed "adaptive smoothing", can work in conjunction with existing or even future methods that improve clean accuracy, robustness, or adversary detection. Our empirical evaluation considers strong attack methods, including AutoAttack and adaptive attack. On the CIFAR-100 dataset, our method achieves an 85.21% clean accuracy while maintaining a 38.72% $\ell_\infty$-AutoAttacked ($\epsilon$=8/255) accuracy, becoming the second most robust method on the RobustBench CIFAR-100 benchmark as of submission, while improving the clean accuracy by ten percentage points compared with all listed models. The code that implements our method is available at https://github.com/Bai-YT/AdaptiveSmoothing.
翻译:尽管先前的研究提出了大量增强神经分类器对抗鲁棒性的方法,但由于这些技术在干净样本准确率上造成难以接受的严重损失,从业者仍不愿采用这些技术。本文表明,通过混合标准分类器(针对干净准确率优化且通常不具有鲁棒性)和鲁棒模型的输出概率,这种准确率-鲁棒性权衡可以得到显著缓解。我们发现,鲁棒基分类器对正确与错误示例的置信度差异是这一改进的关键因素。除了提供直观与经验证据外,我们还在现实假设下从理论上证明了混合分类器的鲁棒性。此外,我们将对抗输入检测器适配为混合网络,该网络自适应调整两个基模型的混合比例,进一步降低了实现鲁棒性所需的准确率代价。所提出的灵活方法称为"自适应平滑",可协同现有甚至未来提升干净准确率、鲁棒性或对抗检测的方法。我们的经验评估考虑了包括AutoAttack和自适应攻击在内的强攻击方法。在CIFAR-100数据集上,该方法在保持85.21%干净准确率的同时,实现了38.72%的$\ell_\infty$-AutoAttacked($\epsilon$=8/255)准确率,提交时成为RobustBench CIFAR-100基准中排名第二的鲁棒方法,且干净准确率比列表中所有模型高出十个百分点。实现该方法的代码已开源至https://github.com/Bai-YT/AdaptiveSmoothing。