Feature attribution is the dominant paradigm for explaining the predictions of complex machine learning models like neural networks. However, most existing methods offer little guarantee of reflecting the model's prediction-making process. We define the notion of explanatory alignment and argue that it is central to trustworthy predictive modeling: in short, it requires that explanations directly underlie predictions rather than serve as rationalizations. We present model readability as a design principle enabling alignment, and Pointwise-interpretable Networks (PiNets) as a modeling framework to pursue it in a deep learning context. PiNets combine statistical intelligence with a pseudo-linear structure that yields instance-wise linear predictions in an arbitrary feature space. We illustrate their use on image classification and segmentation tasks, demonstrating that PiNets produce explanations that are not only aligned by design but also faithful across other dimensions: meaningfulness, robustness, and sufficiency.
翻译:特征归因是解释神经网络等复杂机器学习模型预测的主流范式。然而,现有方法大多难以保证能够反映模型的预测生成过程。我们定义了"解释对齐"的概念,并论证其对可信预测建模的核心意义:简言之,该概念要求解释直接构成预测的基础,而非仅作为事后合理化说明。我们提出将模型可读性作为实现对齐的设计原则,并介绍点态可解释网络(PiNets)作为在深度学习背景下实现该原则的建模框架。PiNets将统计智能与伪线性结构相结合,可在任意特征空间中生成实例级别的线性预测。我们在图像分类与分割任务中展示了其应用,证明PiNets产生的解释不仅在设计上具有对齐性,在其他维度上也具备忠实性:包括意义明确性、鲁棒性和充分性。