We present a simple yet effective method to improve the robustness of Convolutional Neural Networks (CNNs) against adversarial examples by post-processing an adversarially trained model. Our technique, MeanSparse, cascades the activation functions of a trained model with novel operators that sparsify mean-centered feature vectors. This is equivalent to reducing feature variations around the mean, and we show that such reduced variations merely affect the model's utility, yet they strongly attenuate the adversarial perturbations and decrease the attacker's success rate. Our experiments show that, when applied to the top models in the RobustBench leaderboard, it achieves a new robustness record of 72.08% (from 71.07%) and 59.64% (from 59.56%) on CIFAR-10 and ImageNet, respectively, in term of AutoAttack accuracy. Code is available at https://github.com/SPIN-UMass/MeanSparse
翻译:我们提出了一种简单而有效的方法,通过后处理对抗训练模型来提升卷积神经网络(CNNs)对抗对抗样本的鲁棒性。我们的技术MeanSparse将已训练模型的激活函数与新颖的算子级联,这些算子对均值中心化的特征向量进行稀疏化。这等价于减少特征围绕均值的变异,我们证明这种减少的变异对模型效用影响甚微,却能显著衰减对抗性扰动并降低攻击者的成功率。我们的实验表明,当应用于RobustBench排行榜的顶尖模型时,该方法在CIFAR-10和ImageNet数据集上分别实现了72.08%(原71.07%)和59.64%(原59.56%)的AutoAttack准确率,创造了新的鲁棒性记录。代码可在https://github.com/SPIN-UMass/MeanSparse获取。