This paper introduces a high-order Markov chain task to investigate how transformers learn to integrate information from multiple past positions with varying statistical significance. We demonstrate that transformers learn this task incrementally: each stage is defined by the acquisition of specific information through sparse attention patterns. Notably, we identify a shift in learning dynamics from competitive, where heads converge on the most statistically dominant pattern, to cooperative, where heads specialize in distinct patterns. We model these dynamics using simplified differential equations that characterize the trajectory and prove stage-wise convergence results. Our analysis reveals that transformers ascend a complexity ladder by passing through simpler, misspecified hypothesis classes before reaching the full model class. We further show that early stopping acts as an implicit regularizer, biasing the model toward these simpler classes. These results provide a theoretical foundation for the emergence of staged learning and complex behaviors in transformers, offering insights into generalization for natural language processing and algorithmic reasoning.
翻译:本文引入高阶马尔可夫链任务,以研究Transformer如何学习整合来自多个历史位置、具有不同统计显著性的信息。我们证明Transformer以增量方式学习该任务:每个阶段由通过稀疏注意力模式获取特定信息来定义。值得注意的是,我们识别出学习动态从竞争性(注意力头收敛于统计上最主导的模式)向合作性(注意力头专门化于不同模式)的转变。我们使用简化微分方程对这些动态进行建模,该方程刻画了学习轨迹并证明了阶段式收敛结果。我们的分析表明,Transformer通过经过更简单、误设定的假设类别,最终达到完整模型类别,从而攀登复杂性阶梯。我们进一步证明,早停机制充当隐式正则化器,使模型偏向这些更简单的类别。这些结果为Transformer中阶段性学习和复杂行为的出现提供了理论基础,为自然语言处理和算法推理的泛化性提供了见解。