Convolutional neural networks (CNNs) are fragile to small perturbations in the input images. These networks are thus prone to malicious attacks that perturb the inputs to force a misclassification. Such slightly manipulated images aimed at deceiving the classifier are known as adversarial images. In this work, we investigate statistical differences between natural images and adversarial ones. More precisely, we show that employing a proper image transformation and for a class of adversarial attacks, the distribution of the leading digit of the pixels in adversarial images deviates from Benford's law. The stronger the attack, the more distant the resulting distribution is from Benford's law. Our analysis provides a detailed investigation of this new approach that can serve as a basis for alternative adversarial example detection methods that do not need to modify the original CNN classifier neither work on the raw high-dimensional pixels as features to defend against attacks.
翻译:卷积神经网络(CNN)对输入图像中的微小扰动非常敏感,因此容易受到恶意攻击——攻击者通过扰动输入迫使网络产生错误分类。这种旨在欺骗分类器的经过轻微篡改的图像被称为对抗性图像。本文研究了自然图像与对抗性图像之间的统计差异。具体而言,我们证明,在采用合适的图像变换且针对特定类别的对抗性攻击时,对抗性图像像素中首位数字的分布偏离了本福德定律。攻击强度越大,所得分布与本福德定律的偏离程度越大。我们的分析对这一新方法进行了详细探讨,该方法可作为替代性对抗样本检测方法的基础,无需修改原始CNN分类器,也无需将原始高维像素作为特征来防御攻击。