Existing explanation methods for image classification struggle to provide faithful and plausible explanations. This paper addresses this issue by proposing a post-hoc natural language explanation method that can be applied to any CNN-based classifier without altering its training process or affecting predictive performance. By analysing influential neurons and the corresponding activation maps, the method generates a faithful description of the classifier's decision process in the form of a structured meaning representation, which is then converted into text by a language model. Through this pipeline approach, the generated explanations are grounded in the neural network architecture, providing accurate insight into the classification process while remaining accessible to non-experts. Experimental results show that the NLEs constructed by our method are significantly more plausible and faithful. In particular, user interventions in the neural network structure (masking of neurons) are three times more effective than the baselines.
翻译:现有图像分类解释方法难以同时提供忠实且可信的解释。本文通过提出一种事后自然语言解释方法来解决这一问题,该方法可应用于任何基于CNN的分类器,且无需改变其训练过程或影响预测性能。通过分析影响显著的神经元及其对应的激活图,该方法以结构化语义表示的形式生成对分类器决策过程的忠实描述,随后通过语言模型将其转换为文本。通过这种流程化方法,生成的解释植根于神经网络架构,既准确揭示了分类过程,又保持了非专业人士的可理解性。实验结果表明,本方法构建的自然语言解释在可信度与忠实度方面均显著优于基线方法。特别值得注意的是,用户对神经网络结构(神经元掩蔽)的干预效果达到基线方法的三倍。