We study the problem of assessing the robustness of counterfactual explanations for deep learning models. We focus on $\textit{plausible model shifts}$ altering model parameters and propose a novel framework to reason about the robustness property in this setting. To motivate our solution, we begin by showing for the first time that computing the robustness of counterfactuals with respect to plausible model shifts is NP-complete. As this (practically) rules out the existence of scalable algorithms for exactly computing robustness, we propose a novel probabilistic approach which is able to provide tight estimates of robustness with strong guarantees while preserving scalability. Remarkably, and differently from existing solutions targeting plausible model shifts, our approach does not impose requirements on the network to be analyzed, thus enabling robustness analysis on a wider range of architectures. Experiments on four binary classification datasets indicate that our method improves the state of the art in generating robust explanations, outperforming existing methods on a range of metrics.
翻译:本研究探讨评估深度学习模型反事实解释鲁棒性的问题。我们聚焦于改变模型参数的"合理模型偏移",并提出一种新颖框架来推理该设定下的鲁棒性属性。为阐明解决方案的动机,我们首先首次证明:计算反事实解释相对于合理模型偏移的鲁棒性是NP完全问题。鉴于这(实际上)排除了可扩展精确计算鲁棒性算法的存在性,我们提出一种新颖的概率方法,该方法能够在保持可扩展性的同时,以强保证提供紧致的鲁棒性估计。值得注意的是,与现有针对合理模型偏移的解决方案不同,我们的方法不对待分析网络施加约束条件,从而能够在更广泛的架构上实现鲁棒性分析。在四个二分类数据集上的实验表明,我们的方法在生成鲁棒解释方面改进了现有技术水平,在一系列指标上优于现有方法。