Counterfactual explanations for machine learning models are used to find minimal interventions to the feature values such that the model changes the prediction to a different output or a target output. A valid counterfactual explanation should have likely feature values. Here, we address the challenge of generating counterfactual explanations that lie in the same data distribution as that of the training data and more importantly, they belong to the target class distribution. This requirement has been addressed through the incorporation of auto-encoder reconstruction loss in the counterfactual search process. Connecting the output behavior of the classifier to the latent space of the auto-encoder has further improved the speed of the counterfactual search process and the interpretability of the resulting counterfactual explanations. Continuing this line of research, we show further improvement in the interpretability of counterfactual explanations when the auto-encoder is trained in a semi-supervised fashion with class tagged input data. We empirically evaluate our approach on several datasets and show considerable improvement in-terms of several metrics.
翻译:机器学习模型的反事实解释用于寻找对特征值的最小干预,使得模型将预测结果改变为不同的输出或目标输出。有效的反事实解释应包含合理的特征值。本文旨在解决生成反事实解释的挑战,要求这些解释既与训练数据分布一致,又归属于目标类别分布。现有方法通过将自编码器重构损失纳入反事实搜索过程来满足这一需求,并通过将分类器的输出行为与自编码器的潜在空间相连接,进一步提升了反事实搜索的速度及所得解释的可解释性。延续这一研究方向,本文证明当采用类别标注的输入数据以半监督方式训练自编码器时,反事实解释的可解释性可获得进一步提升。我们在多个数据集上对方法进行了实证评估,并在多项指标上展现出显著改进。