The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques which decompose the neuron basis into more interpretable units of model computation, such as sparse autoencoders (SAEs). However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that MLP neurons are as sparse a feature basis as SAEs. We use this finding to develop an end-to-end gradient-based attribution pipeline for circuit tracing on the MLP neuron basis, which surfaces causally effective neurons on a variety of tasks. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city-state-capital task from (Lindsey et al., 2025), we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g. mapping a city to its state), and can be steered to change the model's output. This work thus advances automated interpretability of language models without imposing additional training costs.
翻译:神经网络用于执行计算的高层概念未必与单个神经元对齐(Smolensky, 1986)。因此,语言模型可解释性研究转而采用将神经元基分解为更可解释模型计算单元的技术,例如稀疏自编码器(SAE)。然而,并非所有基于神经元的表征都不可解释。我们首次通过实验证明:MLP神经元作为特征基的稀疏程度与SAE相当。基于此发现,我们开发了一种端到端梯度归因流水线,用于在MLP神经元基上追踪电路,该流水线能在多种任务中识别出因果有效神经元。在标准主谓一致基准测试(Marks等人,2025)上,约10²个MLP神经元构成的电路足以控制模型行为。在(Lindsey等人,2025)的多跳城市-州-首都任务中,我们发现了一个电路:其中小规模神经元群体编码特定的潜在推理步骤(例如将城市映射到其所在州),并且可以通过干预改变模型输出。因此,本研究在不增加额外训练成本的前提下,推进了语言模型的自动化可解释性。