Single neurons in neural networks are often interpretable in that they represent individual, intuitively meaningful features. However, many neurons exhibit $\textit{mixed selectivity}$, i.e., they represent multiple unrelated features. A recent hypothesis proposes that features in deep networks may be represented in $\textit{superposition}$, i.e., on non-orthogonal axes by multiple neurons, since the number of possible interpretable features in natural data is generally larger than the number of neurons in a given network. Accordingly, we should be able to find meaningful directions in activation space that are not aligned with individual neurons. Here, we propose (1) an automated method for quantifying visual interpretability that is validated against a large database of human psychophysics judgments of neuron interpretability, and (2) an approach for finding meaningful directions in network activation space. We leverage these methods to discover directions in convolutional neural networks that are more intuitively meaningful than individual neurons, as we confirm and investigate in a series of analyses. Moreover, we apply the same method to three recent datasets of visual neural responses in the brain and find that our conclusions largely transfer to real neural data, suggesting that superposition might be deployed by the brain. This also provides a link with disentanglement and raises fundamental questions about robust, efficient and factorized representations in both artificial and biological neural systems.
翻译:神经网络中的单个神经元通常是可解释的,因为它们表征独立且直观有意义的特征。然而,许多神经元表现出$\textit{混合选择性}$,即表征多个无关特征。近期假说认为,深度网络中的特征可能以$\textit{叠加}$形式存在,即由多个神经元在非正交轴上共同表征,因为自然数据中潜在的可解释特征数量通常大于给定网络的神经元数量。据此,我们应当能够在激活空间中寻找到不与单个神经元对齐的有意义方向。本文提出:(1)一种自动化量化视觉可解释性的方法,该方法经由大规模人类神经元可解释性心理物理判断数据库验证;(2)一种在网络激活空间中寻找有意义方向的方法。我们利用这些方法在卷积神经网络中发现比单个神经元更具直观意义的特征方向,并通过系列分析进行验证与探究。进一步,我们将相同方法应用于近期三个大脑视觉神经响应数据集,发现结论可基本迁移至真实神经数据,表明大脑可能也采用了叠加表征机制。这为解缠表征研究提供了纽带,并引发了关于人工与生物神经系统中鲁棒、高效及分解化表征的根本性问题。