Pervasive polysemanticity in large language models (LLMs) undermines discrete neuron-concept attribution, posing a significant challenge for model interpretation and control. We systematically analyze both encoder and decoder based LLMs across diverse datasets, and observe that even highly salient neurons for specific semantic concepts consistently exhibit polysemantic behavior. Importantly, we uncover a consistent pattern: concept-conditioned activation magnitudes of neurons form distinct, often Gaussian-like distributions with minimal overlap. Building on this observation, we hypothesize that interpreting and intervening on concept-specific activation ranges can enable more precise interpretability and targeted manipulation in LLMs. To this end, we introduce NeuronLens, a novel range-based interpretation and manipulation framework, that localizes concept attribution to activation ranges within a neuron. Extensive empirical evaluations show that range-based interventions enable effective manipulation of target concepts while causing substantially less collateral degradation to auxiliary concepts and overall model performance compared to neuron-level masking.
翻译:大型语言模型(LLMs)中普遍的语义多义性破坏了离散神经元-概念归因机制,给模型解释与控制带来重大挑战。我们系统分析了基于编码器和解码器的各类LLMs,发现即使在特定语义概念上具有高度显著性的神经元,也始终表现出多语义行为。重要的是,我们揭示了一个一致模式:概念条件化激活幅度在神经元中形成具有最小重叠的、通常类似高斯分布的独特分布。基于此发现,我们假设对概念特定激活区间进行解释与干预,能在LLMs中实现更精确的可解释性和定向操控。为此,我们提出NeuronLens——一种新型的基于区间的解释与操控框架,将概念归因定位至神经元内的激活区间。广泛实证评估表明,与神经元层级掩码相比,基于区间的干预能在有效操控目标概念的同时,显著减少对辅助概念和整体模型性能的附带损害。