This work concerns the development of deep networks that are certifiably robust to adversarial attacks. Joint robust classification-detection was recently introduced as a certified defense mechanism, where adversarial examples are either correctly classified or assigned to the "abstain" class. In this work, we show that such a provable framework can benefit by extension to networks with multiple explicit abstain classes, where the adversarial examples are adaptively assigned to those. We show that naively adding multiple abstain classes can lead to "model degeneracy", then we propose a regularization approach and a training method to counter this degeneracy by promoting full use of the multiple abstain classes. Our experiments demonstrate that the proposed approach consistently achieves favorable standard vs. robust verified accuracy tradeoffs, outperforming state-of-the-art algorithms for various choices of number of abstain classes.
翻译:本研究关注开发能够被认证为对对抗攻击具有鲁棒性的深度网络。联合鲁棒分类-检测方法最近被提出作为一种可认证的防御机制,其中对抗样本要么被正确分类,要么被归入“拒绝”类。本文表明,这种可证明框架可以通过扩展到具有多重显式拒绝类的网络而获益,使得对抗样本被自适应地分配到这些类别中。我们指出,简单添加多个拒绝类可能导致“模型退化”,随后提出一种正则化方法和训练策略以通过促进多重拒绝类的充分使用来对抗这种退化。实验表明,所提方法在标准精度与鲁棒认证精度的权衡上始终取得有利结果,在不同拒绝类数量选择下均优于现有最优算法。