In-context learning enables transformers to adapt to new tasks from a few examples at inference time, while grokking highlights that this generalization can emerge abruptly only after prolonged training. We study task generalization and grokking in in-context learning using a Bayesian perspective, asking what enables the delayed transition from memorization to generalization. Concretely, we consider modular arithmetic tasks in which a transformer must infer a latent linear function solely from in-context examples and analyze how predictive uncertainty evolves during training. We combine approximate Bayesian techniques to estimate the posterior distribution and we study how uncertainty behaves across training and under changes in task diversity, context length, and context noise. We find that epistemic uncertainty collapses sharply when the model groks, making uncertainty a practical label-free diagnostic of generalization in transformers. Additionally, we provide theoretical support with a simplified Bayesian linear model, showing that asymptotically both delayed generalization and uncertainty peaks arise from the same underlying spectral mechanism, which links grokking time to uncertainty dynamics.
翻译:上下文学习使Transformer能够在推理时从少量样例中适应新任务,而顿悟则昭示这种泛化能力可能仅在长时训练后突然涌现。我们采用贝叶斯视角研究上下文学习中的任务泛化与顿悟现象,探究从记忆到泛化的延迟转换机制。具体而言,我们考虑模算术任务——Transformer必须仅从上下文样例中推断隐线性函数,并分析预测不确定性在训练过程中的演化规律。结合近似贝叶斯技术估计后验分布,我们考察了训练过程中以及任务多样性、上下文长度和上下文噪声变化下不确定性的行为特征。研究发现,当模型产生顿悟时认知不确定性会急剧坍缩,使不确定性成为Transformer泛化能力的实用无标注诊断指标。此外,我们通过简化贝叶斯线性模型提供理论支撑,表明在渐近意义上延迟泛化与不确定性峰值均由同一底层谱机制产生,该机制将顿悟时间与不确定性动态相互关联。