Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of "simple examples": instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler examples, the Transformer learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, policy gradient learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.
翻译:通过基于结果的监督进行强化学习训练的Transformer能够自发地产生生成中间推理步骤(思维链)的能力。然而,稀疏奖励如何驱动策略梯度发现此类系统性推理的机制仍不甚明了。为此,我们分析了单层Transformer在合成图遍历任务上的策略梯度动态,该任务若无思维链则无法解决,但允许一种简单的迭代解法。我们证明,尽管仅基于最终答案的正确性进行训练,策略梯度仍能驱动Transformer收敛至一种结构化、可解释的算法,该算法能够逐顶点迭代地遍历图。我们刻画了这种涌现现象所需的分布特性,识别出“简单示例”(即需要较少推理步骤的实例)的关键作用。当训练分布为这些较简单的示例分配足够的概率质量时,Transformer能够学习一种可泛化的遍历策略,并将其外推至更长的链;当这种概率质量消失时,策略梯度学习将变得不可行。我们通过在合成数据上的实验以及在数学推理任务上使用真实世界语言模型的实验,证实了我们的理论结果,验证了我们的理论发现在实际场景中同样成立。