Single-Class Target-Specific Attack against Interpretable Deep Learning Systems

In this paper, we present a novel Single-class target-specific Adversarial attack called SingleADV. The goal of SingleADV is to generate a universal perturbation that deceives the target model into confusing a specific category of objects with a target category while ensuring highly relevant and accurate interpretations. The universal perturbation is stochastically and iteratively optimized by minimizing the adversarial loss that is designed to consider both the classifier and interpreter costs in targeted and non-targeted categories. In this optimization framework, ruled by the first- and second-moment estimations, the desired loss surface promotes high confidence and interpretation score of adversarial samples. By avoiding unintended misclassification of samples from other categories, SingleADV enables more effective targeted attacks on interpretable deep learning systems in both white-box and black-box scenarios. To evaluate the effectiveness of SingleADV, we conduct experiments using four different model architectures (ResNet-50, VGG-16, DenseNet-169, and Inception-V3) coupled with three interpretation models (CAM, Grad, and MASK). Through extensive empirical evaluation, we demonstrate that SingleADV effectively deceives the target deep learning models and their associated interpreters under various conditions and settings. Our experimental results show that the performance of SingleADV is effective, with an average fooling ratio of 0.74 and an adversarial confidence level of 0.78 in generating deceptive adversarial samples. Furthermore, we discuss several countermeasures against SingleADV, including a transfer-based learning approach and existing preprocessing defenses.

翻译：本文提出了一种新颖的单类目标特定对抗攻击方法——SingleADV。其目标在于生成一种通用扰动，该扰动既能误导目标模型将特定类别的对象混淆为目标类别，又能确保生成高度相关且准确的解释。通过最小化对抗损失，该通用扰动以随机迭代方式进行优化。该损失函数的设计同时考虑了目标类别与非目标类别中分类器与解释器的代价。在此优化框架中，基于一阶与二阶矩估计所主导的损失曲面，能够提升对抗样本的高置信度与解释分数。通过避免来自其他类别的样本产生非预期的误分类，SingleADV在白盒与黑盒场景中均能对可解释深度学习系统实施更有效的目标攻击。为评估其有效性，我们在四种不同模型架构（ResNet-50、VGG-16、DenseNet-169及Inception-V3）上结合三种解释模型（CAM、Grad及MASK）开展了实验。广泛的经验评估表明，SingleADV能在多种条件与设置下有效欺骗目标深度学习模型及其关联解释器。实验结果显示，SingleADV性能优异，在生成欺骗性对抗样本时，平均欺骗率达0.74，对抗置信度达0.78。此外，本文讨论了针对SingleADV的若干防御措施，包括基于迁移的学习方法与现有预处理防御手段。