I introduce a unified framework for finding a closed-form interpretation of any single neuron in an artificial neural network. Using this framework I demonstrate how to interpret neural network classifiers to reveal closed-form expressions of the concepts encoded in their decision boundaries. In contrast to neural network-based regression, for classification, it is in general impossible to express the neural network in the form of a symbolic equation even if the neural network itself bases its classification on a quantity that can be written as a closed-form equation. The interpretation framework is based on embedding trained neural networks into an equivalence class of functions that encode the same concept. I interpret these neural networks by finding an intersection between the equivalence class and human-readable equations defined by a symbolic search space. The approach is not limited to classifiers or full neural networks and can be applied to arbitrary neurons in hidden layers or latent spaces.
翻译:本文提出一个统一框架,用于对人工神经网络中任意单个神经元进行闭式解释。通过该框架,我展示了如何解释神经网络分类器,以揭示其决策边界中所编码概念的闭式表达式。与基于神经网络的回归不同,对于分类任务而言,即使神经网络本身基于可写成闭式方程的量进行分类,通常也无法将神经网络表示为符号方程形式。该解释框架的核心思想是将训练好的神经网络嵌入到编码相同概念的等价函数类中。我通过寻找等价类与符号搜索空间定义的可读方程之间的交集来实现对这些神经网络的解释。该方法不仅限于分类器或完整神经网络,还可应用于隐藏层或潜在空间中的任意神经元。