Foiling Explanations in Deep Neural Networks

Deep neural networks (DNNs) have greatly impacted numerous fields over the past decade. Yet despite exhibiting superb performance over many problems, their black-box nature still poses a significant challenge with respect to explainability. Indeed, explainable artificial intelligence (XAI) is crucial in several fields, wherein the answer alone -- sans a reasoning of how said answer was derived -- is of little value. This paper uncovers a troubling property of explanation methods for image-based DNNs: by making small visual changes to the input image -- hardly influencing the network's output -- we demonstrate how explanations may be arbitrarily manipulated through the use of evolution strategies. Our novel algorithm, AttaXAI, a model-agnostic, adversarial attack on XAI algorithms, only requires access to the output logits of a classifier and to the explanation map; these weak assumptions render our approach highly useful where real-world models and data are concerned. We compare our method's performance on two benchmark datasets -- CIFAR100 and ImageNet -- using four different pretrained deep-learning models: VGG16-CIFAR100, VGG16-ImageNet, MobileNet-CIFAR100, and Inception-v3-ImageNet. We find that the XAI methods can be manipulated without the use of gradients or other model internals. Our novel algorithm is successfully able to manipulate an image in a manner imperceptible to the human eye, such that the XAI method outputs a specific explanation map. To our knowledge, this is the first such method in a black-box setting, and we believe it has significant value where explainability is desired, required, or legally mandatory.

翻译：深度神经网络（DNNs）在过去十年间极大地影响了众多领域。然而，尽管其在许多问题上展现出卓越性能，其黑箱性质在可解释性方面仍构成重大挑战。事实上，可解释人工智能（XAI）在若干领域至关重要，在这些领域中，仅给出答案——而不阐述如何得出该答案的推理过程——毫无价值。本文揭示了基于图像的DNN解释方法中一个令人不安的特性：通过向输入图像施加微小的视觉变化（几乎不影响网络输出），我们证明了如何利用进化策略任意操纵解释结果。我们提出的新算法AttaXAI是一种针对XAI算法的与模型无关的对抗攻击方法，仅需访问分类器的输出logits和解释图；这些较弱的假设使我们的方法在处理真实世界模型和数据时极具实用性。我们使用四种不同的预训练深度学习模型——VGG16-CIFAR100、VGG16-ImageNet、MobileNet-CIFAR100和Inception-v3-ImageNet，在两个基准数据集CIFAR100和ImageNet上比较了方法的性能。我们发现，无需利用梯度或其他模型内部信息即可操纵XAI方法。我们的新算法能够以人眼无法察觉的方式成功修饰图像，使得XAI方法输出特定的解释图。据我们所知，这是首个在黑箱设置中实现此类操作的方法，我们相信它在需要、要求或法律强制要求可解释性的场景中具有重要价值。