Despite excellent performance of deep neural networks (DNNs) in image classification, detection, and prediction, characterizing how DNNs make a given decision remains an open problem, resulting in a number of interpretability methods. Post-hoc interpretability methods primarily aim to quantify the importance of input features with respect to the class probabilities. However, due to the lack of ground truth and the existence of interpretability methods with diverse operating characteristics, evaluating these methods is a crucial challenge. A popular approach to evaluate interpretability methods is to perturb input features deemed important for a given prediction and observe the decrease in accuracy. However, perturbation itself may introduce artifacts, since perturbed images may be out-of-distribution (OOD). In this paper, we have conducted computational experiments to estimate the contribution of perturbation artifacts and developed a method to estimate the fidelity of interpretability methods. We demonstrate that, while perturbation artifacts indeed exist, we can minimize and characterize their impact on fidelity estimation by utilizing model accuracy curves from perturbing input features according to the Most Import First (MIF) and Least Import First (LIF) orders. Using the ResNet-50 trained on the ImageNet, we demonstrate the proposed fidelity estimation of four popular post-hoc interpretability methods.
翻译:尽管深度神经网络(DNNs)在图像分类、检测和预测中表现出色,但如何刻画DNNs做出特定决策的过程仍是一个开放性问题,由此催生出众多可解释性方法。事后可解释性方法主要旨在量化输入特征相对于类别概率的重要性。然而,由于缺乏真实标注且存在多种操作特性各异的方法,对这些方法进行评估成为关键挑战。评估可解释性方法的一种常见范式是:扰动被认为对特定预测重要的输入特征,并观测准确率的下降幅度。但扰动本身可能引入伪影,因为被扰动图像可能偏离原始数据分布(OOD)。本文通过计算实验定量估计扰动伪影的影响,并发展出一种可解释性方法保真度评估技术。我们证明:尽管扰动伪影确实存在,但通过采用"最优先重要"(MIF)和"最后重要优先"(LIF)顺序扰动输入特征所获得的模型准确率曲线,可以最小化并刻画其对保真度评估的影响。基于在ImageNet上训练的ResNet-50,我们展示了四种主流事后可解释性方法的保真度评估结果。