With the popularity of deep neural networks (DNNs), model interpretability is becoming a critical concern. Many approaches have been developed to tackle the problem through post-hoc analysis, such as explaining how predictions are made or understanding the meaning of neurons in middle layers. Nevertheless, these methods can only discover the patterns or rules that naturally exist in models. In this work, rather than relying on post-hoc schemes, we proactively instill knowledge to alter the representation of human-understandable concepts in hidden layers. Specifically, we use a hierarchical tree of semantic concepts to store the knowledge, which is leveraged to regularize the representations of image data instances while training deep models. The axes of the latent space are aligned with the semantic concepts, where the hierarchical relations between concepts are also preserved. Experiments on real-world image datasets show that our method improves model interpretability, showing better disentanglement of semantic concepts, without negatively affecting model classification performance.
翻译:随着深度神经网络(DNNs)的普及,模型可解释性已成为关键问题。为应对这一挑战,现有方法多采用事后分析方案,例如解释预测结果的生成机制或揭示中间层神经元的语义含义。然而,这些方法仅能发现模型中天然存在的模式或规则。本研究摒弃事后解释范式,通过主动注入知识来改变隐层中人类可理解概念的表示方式。具体而言,我们采用语义概念层次树存储领域知识,在深度模型训练过程中利用该树状结构对图像数据实例的表示施加正则化约束。潜空间的坐标轴与语义概念对齐,同时保留概念间的层级关系。在真实图像数据集上的实验表明,本方法在提升模型可解释性的同时,实现了更好的语义概念解耦,且未对模型分类性能产生负面影响。