Individual neurons in neural networks often represent a mixture of unrelated features. This phenomenon, called polysemanticity, can make interpreting neural networks more difficult and so we aim to understand its causes. We propose doing so through the lens of feature \emph{capacity}, which is the fractional dimension each feature consumes in the embedding space. We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features (in proportion to their impact on the loss), and entirely ignore the least important features. Polysemanticity is more prevalent when the inputs have higher kurtosis or sparsity and more prevalent in some architectures than others. Given an optimal allocation of capacity, we go on to study the geometry of the embedding space. We find a block-semi-orthogonal structure, with differing block sizes in different models, highlighting the impact of model architecture on the interpretability of its neurons.
翻译:神经网络中的单个神经元通常混合编码不相关的特征。这种被称为"多义性"的现象会增加神经网络的可解释性难度,因此我们旨在理解其成因。我们通过特征"容量"这个视角展开研究——该指标衡量每个特征在嵌入空间中所占的分数维度。在简化模型中,我们发现最优容量分配倾向于用单义表示承载最重要的特征,用多义表示承载次重要特征(其规模与损失函数中的影响程度成比例),而完全忽略最不重要的特征。当输入具有更高的峰度或稀疏性时,多义性更为普遍,且在不同架构中的表现存在差异。基于最优容量分配,我们进一步研究了嵌入空间的几何结构,发现了块半正交化结构,该结构在不同模型中表现出不同的块尺寸,揭示了模型架构对其神经元可解释性的影响。