It has been demonstrated in many scientific fields that artificial neural networks like autoencoders or Siamese networks encode meaningful concepts in their latent spaces. However, there does not exist a comprehensive framework for retrieving this information in a human-readable form without prior knowledge. In order to extract these concepts, we introduce a framework for finding closed-form interpretations of neurons in latent spaces of artificial neural networks. The interpretation framework is based on embedding trained neural networks into an equivalence class of functions that encode the same concept. We interpret these neural networks by finding an intersection between the equivalence class and human-readable equations defined by a symbolic search space. The approach is demonstrated by retrieving invariants of matrices and conserved quantities of dynamical systems from latent spaces of Siamese neural networks.
翻译:已有研究表明,在诸多科学领域中,自编码器或孪生网络等人工神经网络在其潜在空间中编码了有意义的概念。然而,目前尚缺乏一个无需先验知识即可将这些信息以人类可读形式提取出来的综合框架。为提取这些概念,我们提出了一种框架,用于寻找人工神经网络潜在空间中神经元的闭式解释。该解释框架基于将训练好的神经网络嵌入到一个编码相同概念的等价函数类中。我们通过寻找该等价类与由符号搜索空间定义的人类可读方程之间的交集,来解释这些神经网络。该方法通过从孪生神经网络的潜在空间中提取矩阵不变量和动力系统的守恒量来得到验证。