Deep neural networks (DNNs) have significantly boosted the performance of many challenging tasks. Despite the great development, DNNs have also exposed their vulnerability. Recent studies have shown that adversaries can manipulate the predictions of DNNs by adding a universal adversarial perturbation (UAP) to benign samples. On the other hand, increasing efforts have been made to help users understand and explain the inner working of DNNs by highlighting the most informative parts (i.e., attribution maps) of samples with respect to their predictions. Moreover, we first empirically find that such attribution maps between benign and adversarial examples have a significant discrepancy, which has the potential to detect universal adversarial perturbations for defending against adversarial attacks. This finding motivates us to further investigate a new research problem: whether there exist universal adversarial perturbations that are able to jointly attack DNNs classifier and its interpretation with malicious desires. It is challenging to give an explicit answer since these two objectives are seemingly conflicting. In this paper, we propose a novel attacking framework to generate joint universal adversarial perturbations (JUAP), which can fool the DNNs model and misguide the inspection from interpreters simultaneously. Comprehensive experiments on various datasets demonstrate the effectiveness of the proposed method JUAP for joint attacks. To the best of our knowledge, this is the first effort to study UAP for jointly attacking both DNNs and interpretations.
翻译:深度神经网络(DNNs)已显著提升了许多具有挑战性任务的性能。尽管取得了巨大进展,DNNs也暴露了其脆弱性。最近的研究表明,攻击者可以通过在良性样本上添加通用对抗扰动(UAP)来操纵DNNs的预测。另一方面,越来越多的研究致力于通过突出样本中相对于其预测最具信息量的部分(即归因图),帮助用户理解和解释DNNs的内部工作机制。此外,我们首次通过实证发现,良性样本与对抗样本之间的此类归因图存在显著差异,这为检测通用对抗扰动以防御对抗攻击提供了潜在可能。这一发现促使我们进一步研究一个新的科学问题:是否存在能够同时攻击DNNs分类器及其可解释性机制、且符合恶意意图的通用对抗扰动?由于这两个目标看似相互冲突,给出明确答案具有挑战性。本文提出了一种新颖的攻击框架,用于生成联合通用对抗扰动(JUAP),该扰动能同时欺骗DNNs模型并误导解释器的检测分析。在多个数据集上的综合实验证明了所提方法JUAP在联合攻击中的有效性。据我们所知,这是首次针对同时攻击DNNs及其可解释性的UAP问题展开研究。