The vulnerability of neural network classifiers to adversarial attacks is a major obstacle to their deployment in safety-critical applications. Regularization of network parameters during training can be used to improve adversarial robustness and generalization performance. Usually, the network is regularized end-to-end, with parameters at all layers affected by regularization. However, in settings where learning representations is key, such as self-supervised learning (SSL), layers after the feature representation will be discarded when performing inference. For these models, regularizing up to the feature space is more suitable. To this end, we propose a new spectral regularizer for representation learning that encourages black-box adversarial robustness in downstream classification tasks. In supervised classification settings, we show empirically that this method is more effective in boosting test accuracy and robustness than previously-proposed methods that regularize all layers of the network. We then show that this method improves the adversarial robustness of classifiers using representations learned with self-supervised training or transferred from another classification task. In all, our work begins to unveil how representational structure affects adversarial robustness.
翻译:神经网络分类器对对抗攻击的脆弱性是其部署于安全关键应用的主要障碍。训练过程中对网络参数进行正则化可用于提升对抗鲁棒性与泛化性能。通常,网络采用端到端正则化,所有层级的参数均受正则化影响。然而,在表示学习至关重要的场景中(例如自监督学习),执行推理时将丢弃特征表示之后的层级。对此类模型,正则化至特征空间更为适宜。为此,我们提出一种新的谱正则化方法用于表示学习,以促进下游分类任务中的黑盒对抗鲁棒性。在监督分类场景中,我们通过实验证明,相较于先前提出的对网络所有层进行正则化的方法,本方法在提升测试精度与鲁棒性方面更为有效。我们进一步证明,该方法能提升使用自监督训练学习或从其他分类任务迁移所得表示的分类器的对抗鲁棒性。总体而言,我们的工作初步揭示了表示结构如何影响对抗鲁棒性。