Artificial neural networks are prone to being fooled by carefully perturbed inputs which cause an egregious misclassification. These \textit{adversarial} attacks have been the focus of extensive research. Likewise, there has been an abundance of research in ways to detect and defend against them. We introduce a novel approach of detection and interpretation of adversarial attacks from a graph perspective. For an input image, we compute an associated sparse graph using the layer-wise relevance propagation algorithm \cite{bach15}. Specifically, we only keep edges of the neural network with the highest relevance values. Three quantities are then computed from the graph which are then compared against those computed from the training set. The result of the comparison is a classification of the image as benign or adversarial. To make the comparison, two classification methods are introduced: 1) an explicit formula based on Wasserstein distance applied to the degree of node and 2) a logistic regression. Both classification methods produce strong results which lead us to believe that a graph-based interpretation of adversarial attacks is valuable.
翻译:人工神经网络容易受到精心扰动的输入欺骗,导致严重的错误分类。这些对抗攻击一直是广泛研究的焦点。同样,在检测和防御这些攻击方面也出现了大量研究。我们从图的角度引入了一种新颖的对抗攻击检测与解释方法。对于输入图像,我们使用逐层相关性传播算法计算其关联的稀疏图。具体而言,我们仅保留神经网络中相关性值最高的边。然后从图中计算三个量,并将其与训练集计算得到的量进行比较。比较结果将图像分类为良性或对抗性。为了进行比较,我们引入了两种分类方法:1)基于Wasserstein距离应用于节点度的显式公式,以及2)逻辑回归。两种分类方法都产生了强有力的结果,这使我们相信基于图的对抗攻击解释具有重要价值。