Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive gradient descent to discover such systematic reasoning remains poorly understood. We address this by analyzing the gradient flow dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought (CoT) but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, gradient flow drives the model to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of "simple examples": instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler instances, the model learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, gradient-based learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.
翻译:通过基于结果的监督进行强化学习训练的Transformer能够自发产生生成中间推理步骤(思维链)的能力。然而,稀疏奖励如何驱动梯度下降发现此类系统性推理的机制仍不甚明晰。我们通过分析单层Transformer在合成图遍历任务上的梯度流动力学来探讨此问题,该任务若无思维链则无法求解,但允许简单的迭代解法。我们证明,尽管仅基于最终答案正确性进行训练,梯度流仍会驱动模型收敛至一种结构化、可解释的算法,该算法能够逐顶点迭代遍历图结构。我们刻画了这种能力涌现所需的分布特性,指出"简单样本"(即需要较少推理步骤的实例)的关键作用。当训练分布为这些简单实例分配足够概率质量时,模型将学习可泛化的遍历策略,并能外推至更长推理链;当此概率质量消失时,基于梯度的学习将变得不可行。我们通过在合成数据上的实验以及在数学推理任务上使用真实语言模型的实验,验证了理论结果,证明我们的理论发现可迁移至实际应用场景。