We introduce directional routing, a lightweight mechanism that gives each transformer attention head learned suppression directions controlled by a shared router, at 3.9% parameter cost. We train a 433M-parameter model alongside an identical baseline in a single run, then trace the resulting circuits through mechanistic interpretability. Routing becomes the model's dominant computational pathway. Disabling it collapses factual recall to near-zero probability across all 8 test prompts and drops induction accuracy from 93.4% to 0.0%. Knocking out individual attention heads has negligible effect: the primary mover head's removal actually increases target probability, and induction heads retain 98.6% accuracy without their strongest member. The coordination mechanism is irreplaceable; the components it coordinates are not. The model also self-organizes, without explicit pressure, into two regimes: domain-adaptive routing in early layers and fixed syntactic pruning in late layers, where the least-varying layer is the most critical (+42.6 PPL when disabled). Routing reduces perplexity 31-56% relative to the baseline, though downstream multiple-choice benchmarks do not yet reflect these gains.
翻译:本文提出方向性路由机制,这是一种轻量级机制,通过共享路由器控制每个Transformer注意力头学习抑制方向,参数成本仅为3.9%。我们在单次训练中同时训练了一个4.33亿参数的模型与相同配置的基线模型,随后通过机制可解释性方法追踪生成的电路。路由机制成为模型的主导计算路径。禁用该机制会使所有8个测试提示的事实回忆概率降至接近零,并将归纳准确率从93.4%降至0.0%。敲除单个注意力头的影响可忽略不计:移除主要驱动头反而会提升目标概率,且归纳头在缺失最强成员时仍保持98.6%的准确率。该协调机制具有不可替代性;而被协调的组件则不然。模型还在无显式压力的情况下自组织为两种机制:早期层采用领域自适应路由,晚期层采用固定句法剪枝——其中变化最小的层最为关键(禁用时困惑度增加42.6)。相较于基线,路由机制使困惑度降低31-56%,尽管下游多项选择基准测试尚未体现这些增益。