Counterfactual Explanations (CEs) have received increasing interest as a major methodology for explaining neural network classifiers. Usually, CEs for an input-output pair are defined as data points with minimum distance to the input that are classified with a different label than the output. To tackle the established problem that CEs are easily invalidated when model parameters are updated (e.g. retrained), studies have proposed ways to certify the robustness of CEs under model parameter changes bounded by a norm ball. However, existing methods targeting this form of robustness are not sound or complete, and they may generate implausible CEs, i.e., outliers wrt the training dataset. In fact, no existing method simultaneously optimises for proximity and plausibility while preserving robustness guarantees. In this work, we propose Provably RObust and PLAusible Counterfactual Explanations (PROPLACE), a method leveraging on robust optimisation techniques to address the aforementioned limitations in the literature. We formulate an iterative algorithm to compute provably robust CEs and prove its convergence, soundness and completeness. Through a comparative experiment involving six baselines, five of which target robustness, we show that PROPLACE achieves state-of-the-art performances against metrics on three evaluation aspects.
翻译:反事实解释(CEs)作为一种解释神经网络分类器的主要方法,已受到越来越多的关注。通常,对于输入-输出对的反事实解释被定义为与输入距离最小且被分类为与输出不同标签的数据点。为解决已确立的问题——即当模型参数更新(例如重新训练)时,反事实解释容易失效,已有研究提出了在范数球界定的模型参数变化下认证反事实解释鲁棒性的方法。然而,针对此类鲁棒性的现有方法既不健全也不完备,且可能生成不合理(即相对于训练数据集的异常值)的反事实解释。事实上,尚无现有方法能在保持鲁棒性保证的同时联合优化接近性与合理性。在本工作中,我们提出可证明鲁棒且合理的反事实解释(PROPLACE),该方法利用鲁棒优化技术来解决上述文献中的局限性。我们设计了一种迭代算法来计算可证明鲁棒的反事实解释,并证明了其收敛性、健全性和完备性。通过涉及六个基线(其中五个针对鲁棒性)的对比实验,我们展示了PROPLACE在三个评估方面的指标上达到了最先进的性能。