Counterfactual explanations play an important role in detecting bias and improving the explainability of data-driven classification models. A counterfactual explanation (CE) is a minimal perturbed data point for which the decision of the model changes. Most of the existing methods can only provide one CE, which may not be achievable for the user. In this work we derive an iterative method to calculate robust CEs, i.e. CEs that remain valid even after the features are slightly perturbed. To this end, our method provides a whole region of CEs allowing the user to choose a suitable recourse to obtain a desired outcome. We use algorithmic ideas from robust optimization and prove convergence results for the most common machine learning methods including logistic regression, decision trees, random forests, and neural networks. Our experiments show that our method can efficiently generate globally optimal robust CEs for a variety of common data sets and classification models.
翻译:反事实解释在检测偏差和提升数据驱动分类模型的可解释性方面发挥着重要作用。反事实解释(CE)是指通过最小扰动改变模型决策的数据点。现有方法大多仅能提供单一反事实解释,这可能导致用户难以实际采用。本研究提出了一种迭代方法,用于计算鲁棒反事实解释,即即使在特征被轻微扰动后仍保持有效的反事实解释。为此,我们的方法能够生成完整的反事实解释区域,使用户可根据实际需求选择合适的补救措施以实现预期结果。我们借鉴鲁棒优化的算法思想,针对包括逻辑回归、决策树、随机森林和神经网络在内的主流机器学习方法证明了收敛性。实验表明,该方法能够高效地为多种常见数据集和分类模型生成全局最优的鲁棒反事实解释。