We discuss a vulnerability involving a category of attribution methods used to provide explanations for the outputs of convolutional neural networks working as classifiers. It is known that this type of networks are vulnerable to adversarial attacks, in which imperceptible perturbations of the input may alter the outputs of the model. In contrast, here we focus on effects that small modifications in the model may cause on the attribution method without altering the model outputs.
翻译:我们讨论了一类用于解释卷积神经网络分类器输出的归因方法中存在的漏洞。已知此类网络易受对抗性攻击的影响,即输入中难以察觉的扰动可能会改变模型的输出。与此相反,本文重点关注模型中的微小修改可能对归因方法产生的影响,同时不改变模型的输出。