We investigate grokking in transformers through the lens of inductive bias: dispositions arising from architecture or optimization that let the network prefer one solution over another. We first show that architectural choices such as the position of Layer Normalization (LN) strongly modulates grokking speed. This modulation is explained by isolating how LN on specific pathways shapes shortcut-learning and attention entropy. Subsequently, we study how different optimization settings modulate grokking, inducing distinct interpretations of previously proposed controls such as readout scale. Particularly, we find that using readout scale as a control for lazy training can be confounded by learning rate and weight decay in our setting. Accordingly, we show that features evolve continuously throughout training, suggesting grokking in transformers can be more nuanced than a lazy-to-rich transition of the learning regime. Finally, we show how generalization predictably emerges with feature compressibility in grokking, across different modulators of inductive bias. Our code is released at https://tinyurl.com/y52u3cad.
翻译:本研究从归纳偏置的视角探究Transformer中的顿悟现象:即由架构或优化过程产生的、使网络倾向于选择特定解决方案的内在倾向。我们首先证明,层归一化(LN)的位置等架构选择会显著调节顿悟速度。通过分离特定路径上LN对捷径学习与注意力熵的影响机制,我们解释了这种调节作用。随后,我们研究了不同优化设置如何调节顿悟现象,并对先前提出的控制变量(如读出尺度)给出了新的解释。特别地,我们发现将读出尺度作为惰性训练的控制变量时,其作用可能受到学习率与权重衰减的干扰。基于此,我们证明特征在整个训练过程中持续演化,表明Transformer中的顿悟现象可能比学习机制从惰性到丰富的转变更为复杂。最后,我们展示了在不同归纳偏置调节因子作用下,顿悟过程中特征可压缩性与泛化能力的可预测协同演化。代码已发布于https://tinyurl.com/y52u3cad。