We propose a novel training regime termed counterfactual training that leverages counterfactual explanations to increase the explanatory capacity of models. Counterfactual explanations have emerged as a popular post-hoc explanation method for opaque machine learning models: they inform how factual inputs would need to change in order for a model to produce some desired output. To be useful in real-world decision-making systems, counterfactuals should be plausible with respect to the underlying data and actionable with respect to the feature mutability constraints. Much existing research has therefore focused on developing post-hoc methods to generate counterfactuals that meet these desiderata. In this work, we instead hold models directly accountable for the desired end goal: counterfactual training employs counterfactuals during the training phase to minimize the divergence between learned representations and plausible, actionable explanations. We demonstrate empirically and theoretically that our proposed method facilitates training models that deliver inherently desirable counterfactual explanations and additionally exhibit improved adversarial robustness.
翻译:我们提出了一种新颖的训练机制,称为反事实训练,该机制利用反事实解释来增强模型的解释能力。反事实解释已成为针对不透明机器学习模型的一种流行事后解释方法:它们揭示了事实输入需要如何改变,才能使模型产生某种期望的输出。为了在现实世界的决策系统中发挥作用,反事实解释应相对于底层数据具有合理性,并相对于特征可变性约束具有可操作性。因此,现有研究大多集中于开发事后方法来生成满足这些要求的反事实解释。在本研究中,我们转而让模型直接对期望的最终目标负责:反事实训练在训练阶段使用反事实解释,以最小化学习到的表征与合理、可操作的解释之间的差异。我们通过实验和理论证明,我们提出的方法有助于训练出能够提供本质上理想的反事实解释的模型,并且这些模型还表现出更强的对抗鲁棒性。