Neural Fingerprints for Adversarial Attack Detection

Deep learning models for image classification have become standard tools in recent years. A well known vulnerability of these models is their susceptibility to adversarial examples. These are generated by slightly altering an image of a certain class in a way that is imperceptible to humans but causes the model to classify it wrongly as another class. Many algorithms have been proposed to address this problem, falling generally into one of two categories: (i) building robust classifiers (ii) directly detecting attacked images. Despite the good performance of these detectors, we argue that in a white-box setting, where the attacker knows the configuration and weights of the network and the detector, they can overcome the detector by running many examples on a local copy, and sending only those that were not detected to the actual model. This problem is common in security applications where even a very good model is not sufficient to ensure safety. In this paper we propose to overcome this inherent limitation of any static defence with randomization. To do so, one must generate a very large family of detectors with consistent performance, and select one or more of them randomly for each input. For the individual detectors, we suggest the method of neural fingerprints. In the training phase, for each class we repeatedly sample a tiny random subset of neurons from certain layers of the network, and if their average is sufficiently different between clean and attacked images of the focal class they are considered a fingerprint and added to the detector bank. During test time, we sample fingerprints from the bank associated with the label predicted by the model, and detect attacks using a likelihood ratio test. We evaluate our detectors on ImageNet with different attack methods and model architectures, and show near-perfect detection with low rates of false detection.

翻译：近年来，用于图像分类的深度学习模型已成为标准工具。这些模型的一个众所周知的漏洞是它们对对抗性样本的敏感性。这些样本是通过对特定类别的图像进行细微修改而生成的，这种修改对人类来说难以察觉，但会导致模型将其错误地分类为另一个类别。已有许多算法被提出来解决这个问题，这些算法通常分为两类：(i) 构建鲁棒分类器；(ii) 直接检测被攻击的图像。尽管这些检测器性能良好，但我们认为，在白盒设置下（攻击者知道网络和检测器的配置及权重），攻击者可以通过在本地副本上运行大量样本，并仅将未被检测到的样本发送给实际模型，从而绕过检测器。这个问题在安全应用中很常见，即使是一个非常好的模型也不足以确保安全。在本文中，我们提出通过随机化来克服任何静态防御所固有的这一限制。为此，必须生成一个性能一致且非常庞大的检测器家族，并为每个输入随机选择一个或多个检测器。对于单个检测器，我们建议采用神经指纹方法。在训练阶段，对于每个类别，我们反复从网络的特定层中抽取一个微小的随机神经元子集，如果它们在焦点类别的干净图像和受攻击图像之间的平均值存在足够差异，则它们被视为一个指纹并被添加到检测器库中。在测试时，我们从与模型预测标签相关联的库中采样指纹，并使用似然比检验来检测攻击。我们在ImageNet上使用不同的攻击方法和模型架构评估了我们的检测器，结果显示其检测近乎完美，且误检率很低。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日