Negative Pre-activations Differentiate Syntax

Modern large language models increasingly use smooth activation functions such as GELU or SiLU, allowing negative pre-activations to carry both signal and gradient. Nevertheless, many neuron-level interpretability analyses have historically focused on large positive activations, often implicitly treating the negative region as less informative, a carryover from the ReLU-era. We challenge this assumption and ask whether and how negative pre-activations are leveraged by models. We address this question by studying a sparse subpopulation of Wasserstein neurons whose output distributions deviate strongly from a Gaussian baseline and that functionally differentiate similar inputs. We show that this negative region plays an active role rather than reflecting a mere gradient optimization side effect. A minimal, sign-specific intervention that zeroes only the negative pre-activations of a small set of Wasserstein neurons substantially increases perplexity and sharply degrades grammatical performance on BLiMP and TSE, whereas both random and perplexity-matched ablations of many more non-Wasserstein neurons in their negative pre-activations leave grammatical performance largely intact. Conversely, on a suite of non-grammatical benchmarks, the perplexity-matched control ablation is more damaging than the Wasserstein neuron ablation, yielding a double dissociation between syntax and other capabilities. Part-of-speech analysis localizes the excess surprisal to syntactic scaffolding tokens, layer-specific interventions show that small local degradations accumulate across depth, and training-dynamics analysis reveals that the same sign-specific ablation becomes more harmful as Wasserstein neurons emerge and stabilize. Together, these results identify negative pre-activations in a sparse subpopulation of Wasserstein neurons as an actively used substrate for syntax in smooth-activation language models.

翻译：现代大型语言模型越来越多地使用平滑激活函数，如GELU或SiLU，这使得负向预激活既能传递信号也能传递梯度。然而，许多神经元层面的可解释性分析历来关注大的正向激活，常常隐含地将负向区域视为信息量较少，这是ReLU时代遗留的做法。我们挑战这一假设，探究模型是否以及如何利用负向预激活。我们通过研究一个稀疏的Wasserstein神经元亚群来探讨这个问题，这些神经元的输出分布与高斯基线强烈偏离，并在功能上区分相似输入。我们证明，这一负向区域发挥着积极作用，而非仅仅反映梯度优化的副作用。一项最小化的、符号特定的干预——仅将一小部分Wasserstein神经元的负向预激活置零——会显著增加困惑度，并急剧降低在BLiMP和TSE上的语法性能；而随机或困惑度匹配的、对更多非Wasserstein神经元负向预激活的消融则基本不影响语法性能。相反，在一套非语法基准测试中，困惑度匹配的控制消融比Wasserstein神经元消融造成的损害更大，从而在句法能力与其他能力之间呈现出双重分离。词性分析将过高的意外值定位到句法支架标记，层特异性干预表明小的局部退化在深度上累积，训练动态分析揭示，随着Wasserstein神经元的出现和稳定，相同的符号特异性消融变得更具破坏性。综上所述，这些结果将平滑激活语言模型中稀疏Wasserstein神经元亚群的负向预激活确定为句法处理中主动使用的底层机制。