This paper presents an advanced mathematical problem-solving framework, LLaMA-Berry, for enhancing the mathematical reasoning ability of Large Language Models (LLMs). The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path and utilizes a pairwise reward model to evaluate different paths globally. By leveraging the self-critic and rewriting capabilities of LLMs, Self-Refine applied to MCTS (SR-MCTS) overcomes the inefficiencies and limitations of conventional step-wise and greedy search algorithms by fostering a more efficient exploration of solution spaces. Pairwise Preference Reward Model~(PPRM), inspired by Reinforcement Learning from Human Feedback (RLHF), is then used to model pairwise preferences between solutions, utilizing an Enhanced Borda Count (EBC) method to synthesize these preferences into a global ranking score to find better answers. This approach addresses the challenges of scoring variability and non-independent distributions in mathematical reasoning tasks. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability compared to existing methods like ToT and rStar, particularly in complex Olympiad-level benchmarks, including GPQA, AIME24 and AMC23.
翻译:本文提出了一种用于增强大语言模型数学推理能力的高级数学问题求解框架——LLaMA-Berry。该框架将蒙特卡洛树搜索与迭代式自我精炼相结合以优化推理路径,并利用成对奖励模型对不同路径进行全局评估。通过利用大语言模型的自我批判和重写能力,应用于MCTS的自我精炼方法克服了传统逐步搜索和贪婪搜索算法在探索解空间时的低效性和局限性,实现了更高效的探索。随后,受人类反馈强化学习启发的成对偏好奖励模型被用于建模不同解决方案之间的成对偏好,并采用增强型博尔达计数法将这些偏好合成为一个全局排序分数,以寻找更优答案。此方法解决了数学推理任务中评分变异性和非独立分布所带来的挑战。该框架已在通用和高级基准测试中进行了验证,在搜索效率和问题解决能力方面均展现出优于现有方法(如ToT和rStar)的性能,尤其是在包括GPQA、AIME24和AMC23在内的复杂奥林匹克级基准测试中。