There is an emerging interest in generating robust counterfactual explanations that would remain valid if the model is updated or changed even slightly. Towards finding robust counterfactuals, existing literature often assumes that the original model $m$ and the new model $M$ are bounded in the parameter space, i.e., $\|\text{Params}(M){-}\text{Params}(m)\|{<}\Delta$. However, models can often change significantly in the parameter space with little to no change in their predictions or accuracy on the given dataset. In this work, we introduce a mathematical abstraction termed \emph{naturally-occurring} model change, which allows for arbitrary changes in the parameter space such that the change in predictions on points that lie on the data manifold is limited. Next, we propose a measure -- that we call \emph{Stability} -- to quantify the robustness of counterfactuals to potential model changes for differentiable models, e.g., neural networks. Our main contribution is to show that counterfactuals with sufficiently high value of \emph{Stability} as defined by our measure will remain valid after potential ``naturally-occurring'' model changes with high probability (leveraging concentration bounds for Lipschitz function of independent Gaussians). Since our quantification depends on the local Lipschitz constant around a data point which is not always available, we also examine practical relaxations of our proposed measure and demonstrate experimentally how they can be incorporated to find robust counterfactuals for neural networks that are close, realistic, and remain valid after potential model changes.
翻译:生成在模型略有更新或改变时仍保持有效的鲁棒反事实解释正引起学界兴趣。为寻找鲁棒反事实,现有文献通常假设原始模型$m$与新模型$M$在参数空间中有界,即$\|\text{Params}(M){-}\text{Params}(m)\|{<}\Delta$。然而,模型在参数空间发生显著变化却对其预测或给定数据集精度影响甚微的情况屡见不鲜。本研究提出一种称为“自然发生”模型变化的数学抽象概念,允许参数空间中的任意变化,但需将对数据流形上点预测的改变限制在一定范围内。随后,我们提出一种名为"稳定性"的度量指标,用于量化可微模型(如神经网络)反事实对潜在模型变化的鲁棒性。主要贡献在于证明:根据该度量定义具有足够高"稳定性"值的反事实,将在高概率下(利用独立高斯变量Lipschitz函数的浓度界限)于潜在"自然发生"模型变化后保持有效。由于该量化取决于数据点局部Lipschitz常数(该常数并非始终可得),我们还探讨了所提度量在实际中的松弛形式,并通过实验展示如何将其融入以寻找贴近实际、符合现实且能在模型变化后保持有效的神经网络鲁棒反事实。