Deep Neural Networks (DNNs) are expected to provide explanation for users to understand their black-box predictions. Saliency map is a common form of explanation illustrating the heatmap of feature attributions, but it suffers from noise in distinguishing important features. In this paper, we propose a model-agnostic learning method called Saliency Constrained Adaptive Adversarial Training (SCAAT) to improve the quality of such DNN interpretability. By constructing adversarial samples under the guidance of saliency map, SCAAT effectively eliminates most noise and makes saliency maps sparser and more faithful without any modification to the model architecture. We apply SCAAT to multiple DNNs and evaluate the quality of the generated saliency maps on various natural and pathological image datasets. Evaluations on different domains and metrics show that SCAAT significantly improves the interpretability of DNNs by providing more faithful saliency maps without sacrificing their predictive power.
翻译:深度神经网络(DNN)需为用户提供对其黑箱预测的解释。显著性图是一种常见的解释形式,通过特征归因热力图实现可视化,但在区分重要特征时易受噪声干扰。本文提出一种模型无关的学习方法——显著性约束自适应对抗训练(SCAAT),旨在提升DNN可解释性质量。通过显著性图引导构建对抗样本,SCAAT在不修改模型架构的前提下,有效消除大部分噪声,使显著性图更稀疏且更忠实于模型决策。我们将SCAAT应用于多种DNN,并在自然图像与病理图像数据集上评估生成的显著性图质量。跨领域与多指标评估表明,SCAAT在保持预测能力的同时,通过提供更忠实的显著性图显著提升了DNN的可解释性。