Counterfactual explanations (CEs) enhance the interpretability of machine learning models by describing what changes to an input are necessary to change its prediction to a desired class. These explanations are commonly used to guide users' actions, e.g., by describing how a user whose loan application was denied can be approved for a loan in the future. Existing approaches generate CEs by focusing on a single, fixed model, and do not provide any formal guarantees on the CEs' future validity. When models are updated periodically to account for data shift, if the generated CEs are not robust to the shifts, users' actions may no longer have the desired impacts on their predictions. This paper introduces VeriTraCER, an approach that jointly trains a classifier and an explainer to explicitly consider the robustness of the generated CEs to small model shifts. VeriTraCER optimizes over a carefully designed loss function that ensures the verifiable robustness of CEs to local model updates, thus providing deterministic guarantees to CE validity. Our empirical evaluation demonstrates that VeriTraCER generates CEs that (1) are verifiably robust to small model updates and (2) display competitive robustness to state-of-the-art approaches in handling empirical model updates including random initialization, leave-one-out, and distribution shifts.
翻译:反事实解释(CEs)通过描述对输入进行何种更改可使其预测结果变为目标类别,从而增强机器学习模型的可解释性。这类解释常被用于指导用户行动,例如说明贷款申请被拒的用户未来如何调整才能获得批准。现有方法基于单一固定模型生成CEs,且未提供关于CEs未来有效性的形式化保证。当模型因应对数据漂移而周期性更新时,若生成的CEs对漂移不具有鲁棒性,用户的行动将不再对预测结果产生预期影响。本文提出VeriTraCER方法,通过联合训练分类器与解释生成器,显式考虑生成CEs对小规模模型漂移的鲁棒性。VeriTraCER通过精心设计的损失函数进行优化,确保CEs对局部模型更新的可验证鲁棒性,从而为CE有效性提供确定性保证。实验评估表明,VeriTraCER生成的CEs:(1)对小规模模型更新具有可验证鲁棒性;(2)在处理包括随机初始化、留一法及分布漂移在内的实证模型更新时,展现出与现有最优方法相当的鲁棒性。