Although recent advancements in large language models (LLMs) have significantly improved their performance on various tasks, they still face challenges with complex and symbolic multi-step reasoning, particularly in mathematical reasoning. To bolster the mathematical reasoning capabilities of LLMs, most existing efforts concentrate on seeking assistance from either domain experts or GPT-4 for high-quality process-supervised data, which is not only expensive but also labor-intensive. In our study, we propose an innovative framework, AlphaMath, that bypasses the need for process annotations (from humans or GPTs) by leveraging Monte Carlo Tree Search (MCTS). This framework focuses on unleashing the potential of a well-pretrained LLM to autonomously enhance its mathematical reasoning. Specifically, we integrate a value model with the LLM, automatically generating both process supervision and step-level evaluation signals in MCTS. Furthermore, we propose an efficient inference strategy, step-level beam search, where the value model is crafted to assist the policy model (i.e., LLM) in navigating more effective reasoning paths, rather than solely relying on prior probabilities. The experimental results on both in-domain and out-of-domain datasets demonstrate that even without GPT-4 or human-annotated process supervision, our AlphaMath framework achieves comparable or superior results to previous state-of-the-art methods.
翻译:尽管大型语言模型(LLM)在各项任务上的性能已取得显著进展,但其在复杂符号多步推理(尤其是数学推理)方面仍面临挑战。为增强LLM的数学推理能力,现有研究大多依赖领域专家或GPT-4来获取高质量的过程监督数据,这种方法不仅成本高昂且耗费人力。本研究提出创新框架AlphaMath,通过蒙特卡洛树搜索(MCTS)绕开对人工或GPT标注过程的需求,重点挖掘预训练LLM自主提升数学推理能力的潜力。具体而言,我们将价值模型与LLM结合,在MCTS中自动生成过程监督信号与步骤级评估信号。此外,我们提出高效推理策略——步骤级束搜索,其价值模型专门设计用于辅助策略模型(即LLM)探索更有效的推理路径,而非单纯依赖先验概率。在领域内与跨领域数据集上的实验结果表明:即使不使用GPT-4或人工标注的过程监督,AlphaMath框架仍能达到甚至超越现有最优方法的性能。