Polysemantic neurons -- neurons that activate for a set of unrelated features -- have been seen as a significant obstacle towards interpretability of task-optimized deep networks, with implications for AI safety. The classic origin story of polysemanticity is that the data contains more ``features" than neurons, such that learning to perform a task forces the network to co-allocate multiple unrelated features to the same neuron, endangering our ability to understand networks' internal processing. In this work, we present a second and non-mutually exclusive origin story of polysemanticity. We show that polysemanticity can arise incidentally, even when there are ample neurons to represent all features in the data, a phenomenon we term \textit{incidental polysemanticity}. Using a combination of theory and experiments, we show that incidental polysemanticity can arise due to multiple reasons including regularization and neural noise; this incidental polysemanticity occurs because random initialization can, by chance alone, initially assign multiple features to the same neuron, and the training dynamics then strengthen such overlap. Our paper concludes by calling for further research quantifying the performance-polysemanticity tradeoff in task-optimized deep neural networks to better understand to what extent polysemanticity is avoidable.
翻译:多语义神经元——即对一组不相关特征均产生激活的神经元——被视为任务优化深度网络可解释性的重大障碍,并对人工智能安全性具有重要影响。关于多语义性的经典起源假说认为,数据中包含的"特征"数量超过神经元数量,因此学习执行任务会迫使网络将多个不相关特征共置于同一神经元中,从而危及我们理解网络内部处理过程的能力。本研究提出关于多语义性的第二种非互斥性起源假说。我们证明,即使拥有充足神经元来表征数据中的所有特征,多语义性仍可能偶然产生,这种现象被称为"偶然多语义性"。通过理论与实验相结合,我们表明偶然多语义性可能源于正则化、神经噪声等多种因素;其产生机制在于,随机初始化可能仅凭偶然性就将多个特征分配给同一神经元,而训练动力学则会强化这种重叠。本文最后呼吁开展进一步研究,量化任务优化深度神经网络中的性能-多语义性权衡,以更深入理解多语义性在多大程度上是可避免的。