Adversarial robustness continues to be a major challenge for deep learning. A core issue is that robustness to one type of attack often fails to transfer to other attacks. While prior work establishes a theoretical trade-off in robustness against different $L_p$ norms, we show that there is potential for improvement against many commonly used attacks by adopting a domain generalisation approach. Concretely, we treat each type of attack as a domain, and apply the Risk Extrapolation method (REx), which promotes similar levels of robustness against all training attacks. Compared to existing methods, we obtain similar or superior worst-case adversarial robustness on attacks seen during training. Moreover, we achieve superior performance on families or tunings of attacks only encountered at test time. On ensembles of attacks, our approach improves the accuracy from 3.4% the best existing baseline to 25.9% on MNIST, and from 16.9% to 23.5% on CIFAR10.
翻译:对抗鲁棒性仍然是深度学习面临的一大挑战。一个核心问题是,针对某一类攻击的鲁棒性往往无法迁移至其他攻击。尽管先前的研究已从理论上论证了不同$L_p$范数下鲁棒性之间存在权衡,但我们表明,通过采用领域泛化方法,在针对多种常用攻击时仍存在改进潜力。具体而言,我们将每种攻击类型视为一个领域,并应用风险外推法(REx),该方法可促进对所有训练攻击具有相似的鲁棒性水平。与现有方法相比,我们在训练中见过的攻击上获得了相似或更优的最坏情况对抗鲁棒性。此外,在仅在测试阶段遇到的攻击族或参数调优场景下,我们实现了更优的性能。针对攻击集成,我们的方法将MNIST上最佳现有基线的准确率从3.4%提升至25.9%,在CIFAR10上则从16.9%提升至23.5%。