To improve trust and transparency, it is crucial to be able to interpret the decisions of Deep Neural classifiers (DNNs). Instance-level examinations, such as attribution techniques, are commonly employed to interpret the model decisions. However, when interpreting misclassified decisions, human intervention may be required. Analyzing the attribu tions across each class within one instance can be particularly labor intensive and influenced by the bias of the human interpreter. In this paper, we present a novel framework to uncover the weakness of the classifier via counterfactual examples. A prober is introduced to learn the correctness of the classifier's decision in terms of binary code-hit or miss. It enables the creation of the counterfactual example concerning the prober's decision. We test the performance of our prober's misclassification detection and verify its effectiveness on the image classification benchmark datasets. Furthermore, by generating counterfactuals that penetrate the prober, we demonstrate that our framework effectively identifies vulnerabilities in the target classifier without relying on label information on the MNIST dataset.
翻译:为提升信任度与透明度,深度神经网络分类器决策的可解释性至关重要。实例层面的分析方法(如归因技术)常被用于解释模型决策。然而,在解释误分类决策时,往往需要人工介入。针对单个实例逐类别分析归因结果尤其耗时耗力,且易受解释者主观偏差影响。本文提出一种通过反事实样本揭示分类器脆弱性的新框架。该框架引入探测器来学习分类器决策正确性的二进制编码(命中/未命中),从而能够基于探测器的决策生成反事实样本。我们在图像分类基准数据集上测试了探测器误分类检测的性能,验证了其有效性。此外,通过在MNIST数据集上生成穿透探测器的反事实样本,我们证明该框架能在不依赖标签信息的情况下,有效识别目标分类器的脆弱性。