Deep neural networks (DNNs) have greatly impacted numerous fields over the past decade. Yet despite exhibiting superb performance over many problems, their black-box nature still poses a significant challenge with respect to explainability. Indeed, explainable artificial intelligence (XAI) is crucial in several fields, wherein the answer alone -- sans a reasoning of how said answer was derived -- is of little value. This paper uncovers a troubling property of explanation methods for image-based DNNs: by making small visual changes to the input image -- hardly influencing the network's output -- we demonstrate how explanations may be arbitrarily manipulated through the use of evolution strategies. Our novel algorithm, AttaXAI, a model-agnostic, adversarial attack on XAI algorithms, only requires access to the output logits of a classifier and to the explanation map; these weak assumptions render our approach highly useful where real-world models and data are concerned. We compare our method's performance on two benchmark datasets -- CIFAR100 and ImageNet -- using four different pretrained deep-learning models: VGG16-CIFAR100, VGG16-ImageNet, MobileNet-CIFAR100, and Inception-v3-ImageNet. We find that the XAI methods can be manipulated without the use of gradients or other model internals. Our novel algorithm is successfully able to manipulate an image in a manner imperceptible to the human eye, such that the XAI method outputs a specific explanation map. To our knowledge, this is the first such method in a black-box setting, and we believe it has significant value where explainability is desired, required, or legally mandatory.
翻译:深度神经网络(DNN)在过去十年中对众多领域产生了深远影响。然而,尽管它们在许多问题上展现出卓越性能,其黑箱特性在可解释性方面仍构成重大挑战。事实上,可解释人工智能(XAI)在若干领域至关重要——在这些领域中,仅提供答案而不给出推导过程几乎毫无价值。本文揭示了基于图像的深度神经网络解释方法存在的一个令人不安的特性:通过对输入图像施加微小的视觉变化——这些变化几乎不影响网络输出——我们展示了如何利用进化策略任意操纵解释结果。我们提出的新型算法AttaXAI是一种对XAI算法的模型无关对抗攻击方法,该算法仅需访问分类器的输出逻辑值和解释映射图;这些弱假设使得我们的方法在处理真实世界模型和数据时极具实用性。我们使用四种不同的预训练深度学习模型——VGG16-CIFAR100、VGG16-ImageNet、MobileNet-CIFAR100和Inception-v3-ImageNet——在CIFAR100和ImageNet两个基准数据集上比较了方法的性能。研究发现,无需梯度或其他模型内部信息即可操纵XAI方法。我们提出的新算法能成功以人眼不可察觉的方式修改图像,使得XAI方法输出特定的解释映射图。据我们所知,这是黑箱环境中的首个此类方法,且我们认为它在需要、要求或法律强制要求可解释性的场景中具有重要价值。