Deep learning models are susceptible to adversarial samples in white and black-box environments. Although previous studies have shown high attack success rates, coupling DNN models with interpretation models could offer a sense of security when a human expert is involved, who can identify whether a given sample is benign or malicious. However, in white-box environments, interpretable deep learning systems (IDLSes) have been shown to be vulnerable to malicious manipulations. In black-box settings, as access to the components of IDLSes is limited, it becomes more challenging for the adversary to fool the system. In this work, we propose a Query-efficient Score-based black-box attack against IDLSes, QuScore, which requires no knowledge of the target model and its coupled interpretation model. QuScore is based on transfer-based and score-based methods by employing an effective microbial genetic algorithm. Our method is designed to reduce the number of queries necessary to carry out successful attacks, resulting in a more efficient process. By continuously refining the adversarial samples created based on feedback scores from the IDLS, our approach effectively navigates the search space to identify perturbations that can fool the system. We evaluate the attack's effectiveness on four CNN models (Inception, ResNet, VGG, DenseNet) and two interpretation models (CAM, Grad), using both ImageNet and CIFAR datasets. Our results show that the proposed approach is query-efficient with a high attack success rate that can reach between 95% and 100% and transferability with an average success rate of 69% in the ImageNet and CIFAR datasets. Our attack method generates adversarial examples with attribution maps that resemble benign samples. We have also demonstrated that our attack is resilient against various preprocessing defense techniques and can easily be transferred to different DNN models.
翻译:深度学习模型在白盒和黑盒环境下均易受到对抗样本的影响。尽管先前研究表明攻击成功率较高,但当人类专家参与时,将DNN模型与解释模型结合可提供一定安全性,专家能判断给定样本是良性还是恶意。然而,在白盒环境中,可解释深度学习系统已被证明易受恶意操纵。在黑盒设置中,由于对系统组件的访问受限,攻击者欺骗系统的难度更大。本文提出一种针对可解释深度学习系统的高效查询分数型黑盒攻击方法QuScore,该方法无需了解目标模型及其耦合的解释模型。QuScore基于迁移方法和分数方法,通过采用高效的微生物遗传算法实现。我们的方法旨在减少执行成功攻击所需的查询次数,从而提高效率。通过基于系统反馈分数持续优化生成的对抗样本,本方法有效探索搜索空间以识别能欺骗系统的扰动。我们在四个CNN模型(Inception、ResNet、VGG、DenseNet)和两个解释模型(CAM、Grad)上,使用ImageNet和CIFAR数据集评估了攻击效果。结果表明,所提方法具有查询高效性,攻击成功率可达95%至100%,且在ImageNet和CIFAR数据集上的平均迁移成功率为69%。我们的攻击方法生成的对抗样本的归属图与良性样本相似。同时证明该攻击能抵抗多种预处理防御技术,并易于迁移至不同DNN模型。