This paper aims to illustrate the concept-emerging phenomenon in a trained DNN. Specifically, we find that the inference score of a DNN can be disentangled into the effects of a few interactive concepts. These concepts can be understood as causal patterns in a sparse, symbolic causal graph, which explains the DNN. The faithfulness of using such a causal graph to explain the DNN is theoretically guaranteed, because we prove that the causal graph can well mimic the DNN's outputs on an exponential number of different masked samples. Besides, such a causal graph can be further simplified and re-written as an And-Or graph (AOG), without losing much explanation accuracy.
翻译:本文旨在阐述训练后的深度神经网络中概念涌现现象。具体而言,我们证明深度神经网络的推理分数可分解为少数交互概念的作用。这些概念可理解为稀疏符号因果图上的因果模式,该因果图可解释深度神经网络。使用此类因果图解释深度神经网络的可信度具有理论保证,因为我们证明该因果图能够较好地模拟深度神经网络在指数数量不同掩码样本上的输出。此外,此类因果图可进一步简化并重写为与或图,且不会显著降低解释精度。