The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology

Mechanistic interpretability typically relies on post-hoc analysis of trained networks. We instead adopt an interventional approach: testing hypotheses a priori by modifying architectural topology to observe training dynamics. We study grokking - delayed generalization in Transformers trained on cyclic modular addition (Zp) - investigating if specific architectural degrees of freedom prolong the memorization phase. We identify two independent structural factors in standard Transformers: unbounded representational magnitude and data-dependent attention routing. First, we introduce a fully bounded spherical topology enforcing L2 normalization throughout the residual stream and an unembedding matrix with a fixed temperature scale. This removes magnitude-based degrees of freedom, reducing grokking onset time by over 20x without weight decay. Second, a Uniform Attention Ablation overrides data-dependent query-key routing with a uniform distribution, reducing the attention layer to a Continuous Bag-of-Words (CBOW) aggregator. Despite removing adaptive routing, these models achieve 100% generalization across all seeds and bypass the grokking delay entirely. To evaluate whether this acceleration is a task-specific geometric alignment rather than a generic optimization stabilizer, we use non-commutative S5 permutation composition as a negative control. Enforcing spherical constraints on S5 does not accelerate generalization. This suggests eliminating the memorization phase depends strongly on aligning architectural priors with the task's intrinsic symmetries. Together, these findings provide interventional evidence that architectural degrees of freedom substantially influence grokking, suggesting a predictive structural perspective on training dynamics.

翻译：机械可解释性通常依赖于对已训练网络的事后分析。我们则采用干预性方法：通过修改架构拓扑来观察训练动态，从而先验地检验假设。我们研究顿悟现象——在循环模加法（Zp）上训练的Transformer中出现的延迟泛化——探究特定的架构自由度是否会延长记忆阶段。我们在标准Transformer中识别出两个独立的结构因素：无界的表示幅度和数据依赖的注意力路由。首先，我们引入了一种完全有界的球面拓扑，在整个残差流中强制执行L2归一化，并使用具有固定温度尺度的解嵌入矩阵。这消除了基于幅度的自由度，在没有权重衰减的情况下将顿悟起始时间减少了超过20倍。其次，均匀注意力消融用均匀分布覆盖数据依赖的查询-键路由，将注意力层简化为连续词袋（CBOW）聚合器。尽管移除了自适应路由，这些模型在所有随机种子下均实现了100%的泛化，并完全绕过了顿悟延迟。为评估这种加速是任务特定的几何对齐而非通用的优化稳定器，我们使用非交换的S5置换组合作为阴性对照。在S5上强制球面约束并未加速泛化。这表明消除记忆阶段在很大程度上依赖于将架构先验与任务的内在对称性对齐。这些发现共同提供了干预性证据，表明架构自由度显著影响顿悟现象，为训练动态提供了预测性的结构视角。