As neural networks increasingly make critical decisions in high-stakes settings, monitoring and explaining their behavior in an understandable and trustworthy manner is a necessity. One commonly used type of explainer is post hoc feature attribution, a family of methods for giving each feature in an input a score corresponding to its influence on a model's output. A major limitation of this family of explainers in practice is that they can disagree on which features are more important than others. Our contribution in this paper is a method of training models with this disagreement problem in mind. We do this by introducing a Post hoc Explainer Agreement Regularization (PEAR) loss term alongside the standard term corresponding to accuracy, an additional term that measures the difference in feature attribution between a pair of explainers. We observe on three datasets that we can train a model with this loss term to improve explanation consensus on unseen data, and see improved consensus between explainers other than those used in the loss term. We examine the trade-off between improved consensus and model performance. And finally, we study the influence our method has on feature attribution explanations.
翻译:随着神经网络在高风险场景中做出越来越多的关键决策,以可理解且可信的方式监控和解释其行为变得至关重要。一种常用的解释方法是事后特征归因,即通过为输入中的每个特征分配一个分数,表示其对模型输出的影响。这类解释方法在实践中主要存在一个局限性:不同方法可能对哪些特征更重要产生分歧。本文的贡献在于,我们提出一种将分歧问题纳入考量的模型训练方法。具体而言,我们引入了一个事后解释器一致性正则化(PEAR)损失项,与标准准确性损失项相结合,该额外项用于衡量不同解释器之间特征归因的差异。我们在三个数据集上观察到,通过该损失项训练的模型能够在未见数据上提升解释共识,并提高未在损失项中使用的解释器之间的一致性。我们进一步分析了解释共识提升与模型性能之间的权衡,最后探究了该方法对特征归因解释的影响。