Adversarial examples are carefully crafted attack points that are supposed to fool machine learning classifiers. In the last years, the field of adversarial machine learning, especially the study of perturbation-based adversarial examples, in which a perturbation that is not perceptible for humans is added to the images, has been studied extensively. Adversarial training can be used to achieve robustness against such inputs. Another type of adversarial examples are invariance-based adversarial examples, where the images are semantically modified such that the predicted class of the model does not change, but the class that is determined by humans does. How to ensure robustness against this type of adversarial examples has not been explored yet. This work addresses the impact of adversarial training with invariance-based adversarial examples on a convolutional neural network (CNN). We show that when adversarial training with invariance-based and perturbation-based adversarial examples is applied, it should be conducted simultaneously and not consecutively. This procedure can achieve relatively high robustness against both types of adversarial examples. Additionally, we find that the algorithm used for generating invariance-based adversarial examples in prior work does not correctly determine the labels and therefore we use human-determined labels.
翻译:对抗样本是精心设计的攻击点,旨在欺骗机器学习分类器。近年来,对抗机器学习领域,特别是基于扰动的对抗样本研究(即对人类不可感知的扰动被添加到图像中)得到了广泛探索。对抗训练可用于增强对此类输入的鲁棒性。另一类对抗样本是基于不变性的对抗样本,其中图像经过语义修改,使得模型的预测类别不变,但人类判定的类别发生变化。如何确保对此类对抗样本的鲁棒性尚未得到研究。本文探讨了使用基于不变性的对抗样本进行对抗训练对卷积神经网络(CNN)的影响。我们表明,当同时应用基于不变性和基于扰动的对抗样本进行对抗训练时,应同步而非连续进行。这种方法能对这两种对抗样本实现较高的鲁棒性。此外,我们发现先前工作中用于生成基于不变性的对抗样本的算法未能正确确定标签,因此我们采用了人类判定的标签。