Counterfactuals, or modified inputs that lead to a different outcome, are an important tool for understanding the logic used by machine learning classifiers and how to change an undesirable classification. Even if a counterfactual changes a classifier's decision, however, it may not affect the true underlying class probabilities, i.e. the counterfactual may act like an adversarial attack and ``fool'' the classifier. We propose a new framework for creating modified inputs that change the true underlying probabilities in a beneficial way which we call Trustworthy Actionable Perturbations (TAP). This includes a novel verification procedure to ensure that TAP change the true class probabilities instead of acting adversarially. Our framework also includes new cost, reward, and goal definitions that are better suited to effectuating change in the real world. We present PAC-learnability results for our verification procedure and theoretically analyze our new method for measuring reward. We also develop a methodology for creating TAP and compare our results to those achieved by previous counterfactual methods.
翻译:反事实解释(即导致不同分类结果的修改后输入)是理解机器学习分类器逻辑以及如何改变不良分类的重要工具。然而,即便反事实解释能够改变分类器的决策,它也可能无法影响真实的底层类别概率——换言之,反事实解释可能如同对抗攻击一般"欺骗"分类器。为此,我们提出一种名为"值得信赖的可操作扰动"(Trustworthy Actionable Perturbations, TAP)的新框架,用于生成能以有益方式改变真实底层概率的修改后输入。该框架包含一种新颖的验证程序,可确保TAP改变真实类别概率而非产生对抗性作用。我们的框架还定义了更适用于现实世界改变的新成本、奖励和目标函数。我们给出了验证过程的PAC可学习性结果,并从理论上分析了新的奖励度量方法。同时,我们开发了生成TAP的具体方法,并将结果与现有反事实方法进行了对比。