The reasoning abilities of large language models (LLMs) have improved with chain-of-thought (CoT) prompting, allowing models to solve complex tasks in a stepwise manner. However, training CoT capabilities requires detailed reasoning data, which is often scarce. The self-taught reasoner (STaR) framework addresses this by using reinforcement learning to automatically generate reasoning steps, reducing reliance on human-labeled data. Although STaR and its variants have demonstrated empirical success, a theoretical foundation explaining these improvements is lacking. This work provides a theoretical framework for understanding the effectiveness of reinforcement learning on CoT reasoning and STaR. Our contributions are: (1) an analysis of policy improvement, showing why LLM reasoning improves iteratively with STaR; (2) conditions for convergence to an optimal reasoning policy; (3) an examination of STaR's robustness, explaining how it can improve reasoning even when incorporating occasional incorrect steps; and (4) criteria for the quality of pre-trained models necessary to initiate effective reasoning improvement. This framework aims to bridge empirical findings with theoretical insights, advancing reinforcement learning approaches for reasoning in LLMs.
翻译:大型语言模型(LLM)的推理能力通过思维链(CoT)提示得到了提升,使模型能够以逐步方式解决复杂任务。然而,训练CoT能力需要详细的推理数据,而这些数据往往稀缺。自教导推理器(STaR)框架通过使用强化学习自动生成推理步骤来解决这一问题,减少了对人工标注数据的依赖。尽管STaR及其变体已展现出实证上的成功,但解释这些改进的理论基础尚不完善。本研究提供了一个理论框架,用于理解强化学习在CoT推理和STaR上的有效性。我们的贡献包括:(1)对策略改进的分析,阐明了LLM推理为何能通过STaR迭代提升;(2)收敛至最优推理策略的条件;(3)对STaR鲁棒性的考察,解释了其即使在偶尔引入错误步骤时仍能改进推理的原因;(4)启动有效推理改进所需的预训练模型质量标准。该框架旨在将实证发现与理论洞见相结合,推动LLM推理中强化学习方法的发展。