Predictions made by graph neural networks (GNNs) usually lack interpretability due to their complex computational behavior and the abstract nature of graphs. In an attempt to tackle this, many GNN explanation methods have emerged. Their goal is to explain a model's predictions and thereby obtain trust when GNN models are deployed in decision critical applications. Most GNN explanation methods work in a post-hoc manner and provide explanations in the form of a small subset of important edges and/or nodes. In this paper we demonstrate that these explanations can unfortunately not be trusted, as common GNN explanation methods turn out to be highly susceptible to adversarial perturbations. That is, even small perturbations of the original graph structure that preserve the model's predictions may yield drastically different explanations. This calls into question the trustworthiness and practical utility of post-hoc explanation methods for GNNs. To be able to attack GNN explanation models, we devise a novel attack method dubbed \textit{GXAttack}, the first \textit{optimization-based} adversarial attack method for post-hoc GNN explanations under such settings. Due to the devastating effectiveness of our attack, we call for an adversarial evaluation of future GNN explainers to demonstrate their robustness.
翻译:图神经网络(GNN)的预测通常因其复杂的计算行为与图的抽象特性而缺乏可解释性。为解决这一问题,多种GNN解释方法相继出现。这些方法旨在解释模型的预测结果,从而在GNN模型部署于关键决策应用时获得信任。大多数GNN解释方法以后处理方式工作,并以重要边和/或节点的子集形式提供解释。本文指出,这些解释实际上并不可信,因为常见的GNN解释方法极易受到对抗性扰动的影响。即使是对原始图结构进行微小扰动(且保持模型预测不变),也可能导致解释结果发生剧烈变化。这使后处理式GNN解释方法的可信度与实际效用受到质疑。为实现对GNN解释模型的攻击,我们提出了一种名为\textit{GXAttack}的新型攻击方法——这是首个在此类场景下针对后处理GNN解释的\textit{基于优化的}对抗攻击方法。鉴于该攻击具有破坏性效果,我们呼吁未来GNN解释器需通过对抗性评估来证明其鲁棒性。