Despite excellent performance of deep neural networks (DNNs) in image classification, detection, and prediction, characterizing how DNNs make a given decision remains an open problem, resulting in a number of interpretability methods. Post-hoc interpretability methods primarily aim to quantify the importance of input features with respect to the class probabilities. However, due to the lack of ground truth and the existence of interpretability methods with diverse operating characteristics, evaluating these methods is a crucial challenge. A popular approach to evaluate interpretability methods is to perturb input features deemed important for a given prediction and observe the decrease in accuracy. However, perturbation itself may introduce artifacts. We propose a method for estimating the impact of such artifacts on the fidelity estimation by utilizing model accuracy curves from perturbing input features according to the Most Import First (MIF) and Least Import First (LIF) orders. Using the ResNet-50 trained on the ImageNet, we demonstrate the proposed fidelity estimation of four popular post-hoc interpretability methods.
翻译:尽管深度神经网络在图像分类、检测与预测任务中展现出卓越性能,但解释其决策机制仍是一个悬而未决的问题,由此催生了多种可解释性方法。事后可解释性方法主要旨在量化输入特征对类别概率的重要性。然而,由于缺乏真值基准且不同可解释性方法具有迥异的运行特性,对这些方法的评估成为关键挑战。一种通用的评估范式是扰动被判定为对特定预测重要的输入特征,并观测模型准确率的衰减程度。但扰动本身可能引入伪影。我们提出一种新方法,通过依据"最重要优先"和"最不重要优先"顺序扰动输入特征获取模型准确率曲线,来估算此类伪影对保真度评估的影响。基于在ImageNet上训练的ResNet-50模型,我们验证了四种主流事后可解释性方法的保真度评估方案。