We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process inspired by the successful strategy employed by AlphaZero. Our work leverages Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. To enhance consistency in intermediate steps, we combine outcome validation and stepwise self-evaluation, continually updating the quality assessment of newly generated data. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data. Theoretical analysis reveals the critical importance of using on-policy sampled data for successful self-improving. Extensive evaluations on various arithmetic and commonsense reasoning tasks demonstrate remarkable performance improvements over existing models. For instance, our approach outperforms the Mistral-7B Supervised Fine-Tuning (SFT) baseline on GSM8K, MATH, and SciQ, with substantial percentage increases in accuracy to $80.7\%$ (+$4.8\%$), $32.2\%$ (+$3.3\%$), and $88.5\%$ (+$7.7\%$), respectively. Additionally, our research delves into the training and inference compute tradeoff, providing insights into how our method effectively maximizes performance gains.
翻译:我们提出了一种旨在提升大语言模型推理能力的方法,该方法受AlphaZero成功策略启发,采用迭代偏好学习过程。本研究利用蒙特卡洛树搜索的预判能力迭代收集偏好数据,将实例级奖励分解为更细粒度的步骤级信号。为增强中间步骤的一致性,我们结合结果验证与逐步自评估,持续更新新生成数据的质量评估。所提出的算法采用直接偏好优化,利用新生成的步骤级偏好数据更新大语言模型策略。理论分析揭示了使用同策略采样数据对成功自我改进的关键重要性。在多种算术与常识推理任务上的广泛评估表明,该方法相较于现有模型实现了显著的性能提升。例如,在GSM8K、MATH和SciQ数据集上,我们的方法相较Mistral-7B监督微调基线分别实现了$80.7\%$(+$4.8\%$)、$32.2\%$(+$3.3\%$)和$88.5\%$(+$7.7\%$)的准确率大幅提升。此外,本研究深入探讨了训练与推理计算量的权衡关系,揭示了该方法如何有效最大化性能增益。