Deep neural networks (DNNs) can easily be cheated by some imperceptible but purposeful noise added to images, and erroneously classify them. Previous defensive work mostly focused on retraining the models or detecting the noise, but has either shown limited success rates or been attacked by new adversarial examples. Instead of focusing on adversarial images or the interior of DNN models, we observed that adversarial examples generated by different algorithms can be identified based on the output of DNNs (logits). Logit can serve as an exterior feature to train detectors. Then, we propose HOLMES (Hierarchically Organized Light-weight Multiple dEtector System) to reinforce DNNs by detecting potential adversarial examples to minimize the threats they may bring in practical. HOLMES is able to distinguish \textit{unseen} adversarial examples from multiple attacks with high accuracy and low false positive rates than single detector systems even in an adaptive model. To ensure the diversity and randomness of detectors in HOLMES, we use two methods: training dedicated detectors for each label and training detectors with top-k logits. Our effective and inexpensive strategies neither modify original DNN models nor require its internal parameters. HOLMES is not only compatible with all kinds of learning models (even only with external APIs), but also complementary to other defenses to achieve higher detection rates (may also fully protect the system against various adversarial examples).
翻译:深度神经网络(DNNs)容易被添加到图像中的某些难以察觉但具有目的性的噪声所欺骗,从而错误地对图像进行分类。以往的防御工作主要集中在重新训练模型或检测噪声上,但要么成功率有限,要么被新的对抗样本所攻击。与关注对抗图像或DNN模型内部不同,我们观察到,基于DNNs的输出(logits)可以识别由不同算法生成的对抗样本。Logit可以作为外部特征来训练检测器。为此,我们提出了HOLMES(分层组织的轻量级多检测器系统),通过检测潜在的对抗样本来增强DNNs,以最小化它们在实际应用中可能带来的威胁。即使在自适应模型中,HOLMES也能够以比单检测器系统更高的准确率和更低的误报率,从多种攻击中区分出\textit{未见过的}对抗样本。为确保HOLMES中检测器的多样性和随机性,我们采用了两种方法:为每个标签训练专用检测器,以及使用top-k logits训练检测器。我们这些有效且低成本的策略既不修改原始DNN模型,也不要求其内部参数。HOLMES不仅兼容各类学习模型(甚至仅需外部API即可),还能与其他防御方法互补,以实现更高的检测率(也可能完全保护系统免受各种对抗样本的攻击)。