We discuss a vulnerability involving a category of attribution methods used to provide explanations for the outputs of convolutional neural networks working as classifiers. It is known that this type of networks are vulnerable to adversarial attacks, in which imperceptible perturbations of the input may alter the outputs of the model. In contrast, here we focus on effects that small modifications in the model may cause on the attribution method without altering the model outputs.
翻译:我们讨论了一类用于解释卷积神经网络分类器输出的归因方法所存在的脆弱性。已知此类网络易受对抗性攻击的影响,即输入中难以察觉的扰动可能改变模型输出。然而,本文聚焦于模型的微小修改可能对归因方法产生的影响,且这些修改不改变模型输出。