Counterfactual explanations describe how to modify a feature vector in order to flip the outcome of a trained classifier. Obtaining robust counterfactual explanations is essential to provide valid algorithmic recourse and meaningful explanations. We study the robustness of explanations of randomized ensembles, which are always subject to algorithmic uncertainty even when the training data is fixed. We formalize the generation of robust counterfactual explanations as a probabilistic problem and show the link between the robustness of ensemble models and the robustness of base learners. We develop a practical method with good empirical performance and support it with theoretical guarantees for ensembles of convex base learners. Our results show that existing methods give surprisingly low robustness: the validity of naive counterfactuals is below $50\%$ on most data sets and can fall to $20\%$ on problems with many features. In contrast, our method achieves high robustness with only a small increase in the distance from counterfactual explanations to their initial observations.
翻译:反事实解释描述了如何修改特征向量以翻转已训练分类器的预测结果。获得鲁棒的反事实解释对于提供有效的算法追索和有意义解释至关重要。我们研究了随机集成模型解释的鲁棒性——即使训练数据固定,这类模型也始终存在算法不确定性。我们将鲁棒反事实解释的生成形式化为一个概率问题,并揭示了集成模型鲁棒性与基学习器鲁棒性之间的关联。我们开发了一种具有良好实证性能的实用方法,并为凸基学习器集成提供了理论保证。研究表明,现有方法的鲁棒性低得惊人:在大多数数据集上,朴素反事实的有效性低于50%,而在高维特征问题中可降至20%。相比之下,我们的方法仅需小幅增加反事实解释与初始观测值之间的距离,即可实现高鲁棒性。