Deep neural networks are often considered opaque systems, prompting the need for explainability methods to improve trust and accountability. Existing approaches typically attribute test-time predictions either to input features (e.g., pixels in an image) or to influential training examples. We argue that both perspectives should be studied jointly. This work explores *training feature attribution*, which links test predictions to specific regions of specific training images and thereby provides new insights into the inner workings of deep models. Our experiments on vision datasets show that training feature attribution yields fine-grained, test-specific explanations: it identifies harmful examples that drive misclassifications and reveals spurious correlations, such as patch-based shortcuts, that conventional attribution methods fail to expose.
翻译:深度神经网络常被视为不透明的系统,这促使人们需要可解释性方法来增强信任与问责。现有方法通常将测试时的预测归因于输入特征(如图像中的像素)或有影响的训练样本。我们认为这两种视角应当被联合研究。本工作探索了*训练特征归因*,它将测试预测关联到特定训练图像的具体区域,从而为深度模型的内部机制提供了新的洞见。我们在视觉数据集上的实验表明,训练特征归因能产生细粒度、测试特异性的解释:它能识别导致错误分类的有害样本,并揭示传统归因方法未能暴露的虚假相关性,例如基于图像块的捷径学习。