The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

We show that MLP layers in transformer language models perform binary routing of continuous signals: the decision of whether a token needs nonlinear processing is well-captured by binary neuron activations, even though the signals being routed are continuous. In GPT-2 Small (124M parameters), we find that specific neurons implement a consensus architecture -- seven "default-ON" neurons and one exception handler (N2123 in Layer 11) that are 93-98% mutually exclusive -- creating a binary routing switch. A cross-layer analysis reveals a developmental arc: early layers (L1-3) use single gateway neurons to route exceptions without consensus quorums; middle layers (L4-6) show diffuse processing with neither gateway nor consensus; and late layers (L7-11) crystallize full consensus/exception architectures with increasing quorum size (1 to 3 to 7 consensus neurons). Causal validation confirms the routing is functional: removing the MLP at consensus breakdown costs 43.3% perplexity, while at full consensus removing it costs only 10.1% -- exceeding a 4x difference. Comparing binary vs. continuous features for the routing decision confirms that binarization loses essentially no information (79.2% vs. 78.8% accuracy), while continuous activations carry additional magnitude information (R^2 = 0.36 vs. 0.22). This binary routing structure explains why smooth polynomial approximation fails: cross-validated polynomial fits (degrees 2-7) never exceed R^2 = 0.06 for highly nonlinear layers. We propose that the well-established piecewise-affine characterization of deep networks can be complemented by a routing characterization: along the natural data manifold, the piecewise boundaries implement binary decisions about which tokens need nonlinear processing, routing continuous signals through qualitatively different computational paths.

翻译：我们证明，Transformer语言模型中的多层感知机（MLP）层执行连续信号的二元路由：尽管被路由的信号是连续的，但关于一个词元是否需要非线性处理的决策可以很好地由二元神经元激活来捕捉。在GPT-2 Small（1.24亿参数）中，我们发现特定神经元实现了一种共识架构——七个"默认开启"神经元和一个异常处理器（第11层的N2123）以93-98%的互斥性运作——从而形成了一个二元路由开关。跨层分析揭示了一个发展弧线：早期层（L1-3）使用单个网关神经元来路由异常，没有共识法定人数；中间层（L4-6）表现出弥散处理，既无网关也无共识；而后期层（L7-11）则结晶出完整的共识/异常架构，并具有递增的法定人数规模（从1到3再到7个共识神经元）。因果验证证实该路由具有功能性：在共识崩溃时移除MLP会导致困惑度损失43.3%，而在完全共识时移除MLP仅损失10.1%——差异超过4倍。比较用于路由决策的二元特征与连续特征证实，二值化几乎不损失信息（准确率79.2% vs. 78.8%），而连续激活则携带额外的幅度信息（R^2 = 0.36 vs. 0.22）。这种二元路由结构解释了为什么平滑多项式逼近会失败：对于高度非线性层，交叉验证的多项式拟合（2-7次）的R^2从未超过0.06。我们提出，深度网络已确立的分段仿射特征化可以辅以一种路由特征化：沿着自然数据流形，分段边界实现了关于哪些词元需要非线性处理的二元决策，从而将连续信号路由到性质不同的计算路径中。