Counterfactual Explanations are becoming a de-facto standard in post-hoc interpretable machine learning. For a given classifier and an instance classified in an undesired class, its counterfactual explanation corresponds to small perturbations of that instance that allows changing the classification outcome. This work aims to leverage Counterfactual Explanations to detect the important decision boundaries of a pre-trained black-box model. This information is used to build a supervised discretization of the features in the dataset with a tunable granularity. Using the discretized dataset, an optimal Decision Tree can be trained that resembles the black-box model, but that is interpretable and compact. Numerical results on real-world datasets show the effectiveness of the approach in terms of accuracy and sparsity.
翻译:反事实解释正成为可解释机器学习事后分析的事实标准。对于给定的分类器及被归入非期望类别的实例,其反事实解释对应于对该实例进行微小扰动以改变分类结果的操作。本研究旨在利用反事实解释检测预训练黑箱模型的重要决策边界,据此构建具有可调粒度的特征监督离散化方案。利用离散化后的数据集,可训练出具有可解释性与紧凑性的最优决策树,该决策树能够逼近原始黑箱模型。在真实数据集上的数值实验表明,该方法在精确度与稀疏性方面均具有有效性。