Adversarial training attains strong empirical robustness to specific adversarial attacks by training on concrete adversarial perturbations, but it produces neural networks that are not amenable to strong robustness certificates through neural network verification. On the other hand, earlier certified training schemes directly train on bounds from network relaxations to obtain models that are certifiably robust, but display sub-par standard performance. Recent work has shown that state-of-the-art trade-offs between certified robustness and standard performance can be obtained through a family of losses combining adversarial outputs and neural network bounds. Nevertheless, differently from empirical robustness, verifiability still comes at a significant cost in standard performance. In this work, we propose to leverage empirically-robust teachers to improve the performance of certifiably-robust models through knowledge distillation. Using a versatile feature-space distillation objective, we show that distillation from adversarially-trained teachers consistently improves on the state-of-the-art in certified training for ReLU networks across a series of robust computer vision benchmarks.
翻译:对抗训练通过在具体的对抗扰动上进行训练,获得了对特定对抗攻击的强经验鲁棒性,但它产生的神经网络难以通过神经网络验证获得强鲁棒性证明。另一方面,早期的可验证训练方案直接在网络松弛的边界上进行训练,以获得可证明鲁棒但标准性能欠佳的模型。最近的研究表明,通过结合对抗输出和神经网络边界的损失函数族,可以获得经验鲁棒性与标准性能之间最先进的权衡。然而,与经验鲁棒性不同,可验证性仍然以显著的标准性能为代价。在本工作中,我们提出利用经验鲁棒的教师,通过知识蒸馏来提高可验证鲁棒模型的性能。使用一种灵活的特征空间蒸馏目标,我们证明了从对抗训练的教师进行蒸馏,在一系列鲁棒计算机视觉基准测试中,持续改进了ReLU网络可验证训练的最先进水平。