Faithfully summarizing the knowledge encoded by a deep neural network (DNN) into a few symbolic primitive patterns without losing much information represents a core challenge in explainable AI. To this end, Ren et al. (2023c) have derived a series of theorems to prove that the inference score of a DNN can be explained as a small set of interactions between input variables. However, the lack of generalization power makes it still hard to consider such interactions as faithful primitive patterns encoded by the DNN. Therefore, given different DNNs trained for the same task, we develop a new method to extract interactions that are shared by these DNNs. Experiments show that the extracted interactions can better reflect common knowledge shared by different DNNs.
翻译:忠实地将深度神经网络(DNN)编码的知识总结为少数符号化基元模式而避免大量信息损失,是可解释人工智能的核心挑战。为此,Ren等人(2023c)推导了一系列定理,证明DNN的推理得分可解释为输入变量间少量交互的集合。然而,由于缺乏泛化能力,此类交互仍难以被视为DNN编码的忠实基元模式。因此,针对同一任务训练的不同DNN,我们开发了一种新方法,用于提取这些DNN共享的交互。实验表明,所提取的交互能更好地反映不同DNN的共同知识。