Feature attribution is the dominant paradigm for explaining deep neural networks. However, most existing methods only loosely reflect the model's prediction-making process, thereby merely white-painting the black box. We argue that explanatory alignment is a key aspect of trustworthiness in prediction tasks: explanations must be directly linked to predictions, rather than serving as post-hoc rationalizations. We present model readability as a design principle enabling alignment, and PiNets as a modeling framework to pursue it in a deep learning context. PiNets are pseudo-linear networks that produce instance-wise linear predictions in an arbitrary feature space, making them linearly readable. We illustrate their use on image classification and segmentation tasks, demonstrating how PiNets produce explanations that are faithful across multiple criteria in addition to alignment.
翻译:特征归因是解释深度神经网络的主流范式。然而,现有方法大多仅松散地反映模型的预测生成过程,本质上只是对黑箱模型进行"粉饰"。我们认为,解释对齐性是预测任务可信度的关键方面:解释必须与预测直接关联,而非作为事后合理化工具。我们提出模型可读性作为实现对齐的设计原则,并介绍PiNets作为在深度学习背景下实现该原则的建模框架。PiNets是一种伪线性网络,能够在任意特征空间中生成实例级线性预测,从而实现线性可读性。我们通过在图像分类和分割任务上的应用展示,证明PiNets除了实现对齐性外,还能生成满足多重保真度标准的解释。