A major challenge in Explainable AI is in correctly interpreting activations of hidden neurons: accurate interpretations would provide insights into the question of what a deep learning system has internally detected as relevant on the input, demystifying the otherwise black-box character of deep learning systems. The state of the art indicates that hidden node activations can, in some cases, be interpretable in a way that makes sense to humans, but systematic automated methods that would be able to hypothesize and verify interpretations of hidden neuron activations are underexplored. In this paper, we provide such a method and demonstrate that it provides meaningful interpretations. Our approach is based on using large-scale background knowledge approximately 2 million classes curated from the Wikipedia concept hierarchy together with a symbolic reasoning approach called Concept Induction based on description logics, originally developed for applications in the Semantic Web field. Our results show that we can automatically attach meaningful labels from the background knowledge to individual neurons in the dense layer of a Convolutional Neural Network through a hypothesis and verification process.
翻译:可解释人工智能的一个主要挑战在于正确解读隐藏神经元的激活状态:准确的解读能够揭示深度学习系统在输入中内部检测到哪些相关信息,从而消除深度学习系统黑箱特性带来的神秘感。现有研究表明,隐藏节点的激活在某些情况下可以以对人类有意义的方式进行解读,但能够系统化自动假设并验证隐藏神经元激活解读的方法尚待深入探索。本文提出了一种此类方法,并证明其能够提供有意义的解读。我们的方法基于从维基百科概念层级中整理的大规模背景知识(约200万个类别),结合基于描述逻辑的符号推理方法——概念归纳(Concept Induction),该方法最初是为语义网领域的应用而开发的。实验结果表明,通过假设与验证流程,我们能够从背景知识中自动为卷积神经网络密集层的单个神经元附加有意义的标签。