Despite substantial efforts, neural network interpretability remains an elusive goal, with previous research failing to provide succinct explanations of most single neurons' impact on the network output. This limitation is due to the polysemantic nature of most neurons, whereby a given neuron is involved in multiple unrelated network states, complicating the interpretation of that neuron. In this paper, we apply tools developed in neuroscience and information theory to propose both a novel practical approach to network interpretability and theoretical insights into polysemanticity and the density of codes. We infer levels of redundancy in the network's code by inspecting the eigenspectrum of the activation's covariance matrix. Furthermore, we show how random projections can reveal whether a network exhibits a smooth or non-differentiable code and hence how interpretable the code is. This same framework explains the advantages of polysemantic neurons to learning performance and explains trends found in recent results by Elhage et al.~(2022). Our approach advances the pursuit of interpretability in neural networks, providing insights into their underlying structure and suggesting new avenues for circuit-level interpretability.
翻译:尽管付出了大量努力,神经网络的可解释性仍是一个难以实现的目标,先前的研究未能对大多数单个神经元对网络输出的影响提供简洁的解释。这一局限源于大多数神经元的多语义性,即给定神经元参与多个不相关的网络状态,从而复杂化了该神经元的解释。在本文中,我们运用神经科学与信息论中开发的工具,提出了一种面向网络可解释性的新颖实用方法,并提供了关于多语义性与编码密度的理论洞见。我们通过检查激活协方差矩阵的特征谱来推断网络编码中的冗余水平。此外,我们展示了随机投影如何揭示网络是否表现出平滑或不可微的编码,进而说明编码的可解释程度。同一框架还解释了多语义神经元对学习性能的优势,以及Elhage等人(2022)近期发现中的趋势。我们的方法推进了神经网络的解释性研究,提供了对其底层结构的洞察,并指出了电路级可解释性的新方向。