Large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, yet their internal mechanisms remain largely opaque. In this paper, we introduce a simple, lightweight, and broadly applicable method with a focus on isolating neurons that encode specific skills. Building upon prior work that identified "skill neurons" via soft prompt training on classification tasks, our approach extends the analysis to complex scenarios involving multiple skills. We correlate neuron activations with auxiliary metrics -- such as external labels and the model's own confidence score -- thereby uncovering interpretable and task-specific behaviors without the need for manual token aggregation. We empirically validate our method on tasks spanning open-ended text generation and natural language inference, demonstrating its ability to detect neurons that not only drive known skills but also reveal previously unidentified shortcuts in arithmetic reasoning on BigBench.
翻译:大语言模型(LLMs)在广泛任务中展现出卓越能力,但其内部机制仍高度不透明。本文提出一种简单、轻量且广泛适用的方法,专注于分离编码特定技能的神经元。基于先前通过分类任务软提示训练识别“技能神经元”的研究,我们的方法将分析扩展至涉及多重技能的复杂场景。我们将神经元激活与辅助指标——如外部标签和模型自身置信度分数——相关联,从而无需手动标记聚合即可揭示可解释且任务特定的行为。我们在开放式文本生成和自然语言推理等任务上实证验证了该方法,证明其能够检测不仅驱动已知技能、还能揭示BigBench算术推理中先前未识别捷径的神经元。