A basic question within the emerging field of mechanistic interpretability is the degree to which neural networks learn the same underlying mechanisms. In other words, are neural mechanisms universal across different models? In this work, we study the universality of individual neurons across GPT2 models trained from different initial random seeds, motivated by the hypothesis that universal neurons are likely to be interpretable. In particular, we compute pairwise correlations of neuron activations over 100 million tokens for every neuron pair across five different seeds and find that 1-5\% of neurons are universal, that is, pairs of neurons which consistently activate on the same inputs. We then study these universal neurons in detail, finding that they usually have clear interpretations and taxonomize them into a small number of neuron families. We conclude by studying patterns in neuron weights to establish several universal functional roles of neurons in simple circuits: deactivating attention heads, changing the entropy of the next token distribution, and predicting the next token to (not) be within a particular set.
翻译:在新兴的机制可解释性领域,一个基本问题是神经网络在多大程度上学习到相同的潜在机制。换言之,神经机制是否在不同模型中具有通用性?本研究基于"通用神经元很可能具有可解释性"这一假设,探究了从不同随机种子初始训练得到的GPT2模型中单个神经元的通用性。具体而言,我们计算了五个不同种子下每对神经元在1亿个token上的激活相关性,发现1%-5%的神经元具有通用性——即这些神经元对在相同输入上始终产生激活。随后我们深入研究了这些通用神经元,发现它们通常具有清晰的解释,并将其归类为少数神经元家族。最后,我们通过研究神经元的权重模式,确立了神经元在简单电路中的若干通用功能角色:停用注意力头、改变下一个token分布的熵,以及预测下一个token是否(不)属于特定集合。