Reverse engineering adversarial attacks with fingerprints from adversarial examples

In spite of intense research efforts, deep neural networks remain vulnerable to adversarial examples: an input that forces the network to confidently produce incorrect outputs. Adversarial examples are typically generated by an attack algorithm that optimizes a perturbation added to a benign input. Many such algorithms have been developed. If it were possible to reverse engineer attack algorithms from adversarial examples, this could deter bad actors because of the possibility of attribution. Here we formulate reverse engineering as a supervised learning problem where the goal is to assign an adversarial example to a class that represents the algorithm and parameters used. To our knowledge it has not been previously shown whether this is even possible. We first test whether we can classify the perturbations added to images by attacks on undefended single-label image classification models. Taking a "fight fire with fire" approach, we leverage the sensitivity of deep neural networks to adversarial examples, training them to classify these perturbations. On a 17-class dataset (5 attacks, 4 bounded with 4 epsilon values each), we achieve an accuracy of 99.4% with a ResNet50 model trained on the perturbations. We then ask whether we can perform this task without access to the perturbations, obtaining an estimate of them with signal processing algorithms, an approach we call "fingerprinting". We find the JPEG algorithm serves as a simple yet effective fingerprinter (85.05% accuracy), providing a strong baseline for future work. We discuss how our approach can be extended to attack agnostic, learnable fingerprints, and to open-world scenarios with unknown attacks.

翻译：尽管研究投入巨大，深度神经网络仍然容易受到对抗样本的攻击：这种输入迫使网络以高置信度产生错误的输出。对抗样本通常由攻击算法生成，该算法优化添加在良性输入上的扰动。目前已发展出多种此类算法。如果能够通过对抗样本逆向工程攻击算法，这可能会因归因可能性而威慑恶意行为者。在此，我们将逆向工程表述为一个监督学习问题，其目标是将对抗样本分配给代表所用算法和参数的类别。据我们所知，此前尚未证明这是否可能。我们首先测试能否对无防御的单标签图像分类模型攻击所添加的扰动进行分类。采用"以火攻火"的方法，我们利用深度神经网络对对抗样本的敏感性，训练它们对这些扰动进行分类。在一个17类数据集（5种攻击，其中4种带有4个不同epsilon值的边界约束）上，使用基于扰动训练的ResNet50模型，我们实现了99.4%的准确率。接着我们探究是否能在不直接获取扰动的情况下完成此任务，即通过信号处理算法估计扰动，这种方法我们称之为"指纹提取"。我们发现JPEG算法是一种简单而有效的指纹提取方法（准确率85.05%），为未来工作提供了强有力的基线。我们讨论了如何将我们的方法扩展到攻击无关的可学习指纹，以及面对未知攻击的开放世界场景。