Deep neural network-based classifiers are prone to errors when processing adversarial examples (AEs). AEs are minimally perturbed input data undetectable to humans posing significant risks to security-dependent applications. Hence, extensive research has been undertaken to develop defense mechanisms that mitigate their threats. Most existing methods primarily focus on discriminating AEs based on the input sample features, emphasizing AE detection without addressing the correct sample categorization before an attack. While some tasks may only require mere rejection on detected AEs, others necessitate identifying the correct original input category such as traffic sign recognition in autonomous driving. The objective of this study is to propose a method for rectifying AEs to estimate the correct labels of their original inputs. Our method is based on re-attacking AEs to move them beyond the decision boundary for accurate label prediction, effectively addressing the issue of rectifying minimally perceptible AEs created using white-box attack methods. However, challenge remains with respect to effectively rectifying AEs produced by black-box attacks at a distance from the boundary, or those misclassified into low-confidence categories by targeted attacks. By adopting a straightforward approach of only considering AEs as inputs, the proposed method can address diverse attacks while avoiding the requirement of parameter adjustments or preliminary training. Results demonstrate that the proposed method exhibits consistent performance in rectifying AEs generated via various attack methods, including targeted and black-box attacks. Moreover, it outperforms conventional rectification and input transformation methods in terms of stability against various attacks.
翻译:基于深度神经网络的分类器在处理对抗样本时容易产生错误。对抗样本是对人类不可察觉的输入数据进行微小扰动而形成的,对依赖安全性的应用构成重大风险。因此,学界已开展广泛研究以开发减轻其威胁的防御机制。现有方法大多主要基于输入样本特征来区分对抗样本,强调对抗样本的检测,而未解决攻击前正确样本分类的问题。虽然某些任务可能仅需对检测到的对抗样本进行简单拒绝,但其他任务(如自动驾驶中的交通标志识别)则需要识别正确的原始输入类别。本研究的目标是提出一种校正对抗样本的方法,以估计其原始输入的正确标签。我们的方法基于对对抗样本进行再次攻击,使其越过决策边界以实现准确的标签预测,从而有效解决了校正使用白盒攻击方法生成的、具有最小可感知性的对抗样本的问题。然而,对于有效校正由黑盒攻击产生的、远离边界的对抗样本,或由定向攻击错误分类为低置信度类别的对抗样本,仍然存在挑战。通过采用仅将对抗样本作为输入的简单方法,所提出的方法能够应对多种攻击,同时避免了参数调整或预训练的要求。结果表明,所提出的方法在通过多种攻击方法(包括定向攻击和黑盒攻击)生成的对抗样本校正方面表现出稳定的性能。此外,在抵御各种攻击的稳定性方面,它优于传统的校正和输入变换方法。