Polysemantic neurons (neurons that activate for a set of unrelated features) have been seen as a significant obstacle towards interpretability of task-optimized deep networks, with implications for AI safety. The classic origin story of polysemanticity is that the data contains more "features" than neurons, such that learning to perform a task forces the network to co-allocate multiple unrelated features to the same neuron, endangering our ability to understand the network's internal processing. In this work, we present a second and non-mutually exclusive origin story of polysemanticity. We show that polysemanticity can arise incidentally, even when there are ample neurons to represent all features in the data, using a combination of theory and experiments. This second type of polysemanticity occurs because random initialization can, by chance alone, initially assign multiple features to the same neuron, and the training dynamics then strengthen such overlap. Due to its origin, we term this \textit{incidental polysemanticity}.
翻译:多义神经元(即对一组不相关特征产生激活的神经元)一直被视为理解任务优化型深度网络可解释性的重大障碍,并对人工智能安全产生影响。多义性的经典成因解释是:数据中包含的“特征”数量超过神经元数量,导致网络在执行任务时不得不将多个不相关的特征共置于同一神经元中,从而妨碍我们理解网络的内部处理机制。在本研究中,我们提出了第二种非互斥的多义性成因解释。通过理论与实验相结合的方式,我们证明多义性可能偶然产生——即使数据中所有特征都有充足的神经元表征时也是如此。这种第二类多义性之所以出现,是因为随机初始化本身可能偶然将多个特征分配给同一神经元,而训练动态则会强化这种重叠现象。基于其成因,我们将其命名为“偶然多义性”。