Deep learning models for image classification have become standard tools in recent years. A well known vulnerability of these models is their susceptibility to adversarial examples. These are generated by slightly altering an image of a certain class in a way that is imperceptible to humans but causes the model to classify it wrongly as another class. Many algorithms have been proposed to address this problem, falling generally into one of two categories: (i) building robust classifiers (ii) directly detecting attacked images. Despite the good performance of these detectors, we argue that in a white-box setting, where the attacker knows the configuration and weights of the network and the detector, they can overcome the detector by running many examples on a local copy, and sending only those that were not detected to the actual model. This problem is common in security applications where even a very good model is not sufficient to ensure safety. In this paper we propose to overcome this inherent limitation of any static defence with randomization. To do so, one must generate a very large family of detectors with consistent performance, and select one or more of them randomly for each input. For the individual detectors, we suggest the method of neural fingerprints. In the training phase, for each class we repeatedly sample a tiny random subset of neurons from certain layers of the network, and if their average is sufficiently different between clean and attacked images of the focal class they are considered a fingerprint and added to the detector bank. During test time, we sample fingerprints from the bank associated with the label predicted by the model, and detect attacks using a likelihood ratio test. We evaluate our detectors on ImageNet with different attack methods and model architectures, and show near-perfect detection with low rates of false detection.
翻译:近年来,用于图像分类的深度学习模型已成为标准工具。这些模型的一个众所周知的漏洞是它们对对抗性样本的敏感性。这些样本是通过对特定类别的图像进行细微修改而生成的,这种修改对人类来说难以察觉,但会导致模型将其错误地分类为另一个类别。已有许多算法被提出来解决这个问题,这些算法通常分为两类:(i) 构建鲁棒分类器;(ii) 直接检测被攻击的图像。尽管这些检测器性能良好,但我们认为,在白盒设置下(攻击者知道网络和检测器的配置及权重),攻击者可以通过在本地副本上运行大量样本,并仅将未被检测到的样本发送给实际模型,从而绕过检测器。这个问题在安全应用中很常见,即使是一个非常好的模型也不足以确保安全。在本文中,我们提出通过随机化来克服任何静态防御所固有的这一限制。为此,必须生成一个性能一致且非常庞大的检测器家族,并为每个输入随机选择一个或多个检测器。对于单个检测器,我们建议采用神经指纹方法。在训练阶段,对于每个类别,我们反复从网络的特定层中抽取一个微小的随机神经元子集,如果它们在焦点类别的干净图像和受攻击图像之间的平均值存在足够差异,则它们被视为一个指纹并被添加到检测器库中。在测试时,我们从与模型预测标签相关联的库中采样指纹,并使用似然比检验来检测攻击。我们在ImageNet上使用不同的攻击方法和模型架构评估了我们的检测器,结果显示其检测近乎完美,且误检率很低。