Despite the impressive capabilities of Large Language Models (LLMs) on various tasks, they still struggle with scenarios that involves complex reasoning and planning. Recent work proposed advanced prompting techniques and the necessity of fine-tuning with high-quality data to augment LLMs' reasoning abilities. However, these approaches are inherently constrained by data availability and quality. In light of this, self-correction and self-learning emerge as viable solutions, employing strategies that allow LLMs to refine their outputs and learn from self-assessed rewards. Yet, the efficacy of LLMs in self-refining its response, particularly in complex reasoning and planning task, remains dubious. In this paper, we introduce AlphaLLM for the self-improvements of LLMs, which integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop, thereby enhancing the capabilities of LLMs without additional annotations. Drawing inspiration from the success of AlphaGo, AlphaLLM addresses the unique challenges of combining MCTS with LLM for self-improvement, including data scarcity, the vastness search spaces of language tasks, and the subjective nature of feedback in language tasks. AlphaLLM is comprised of prompt synthesis component, an efficient MCTS approach tailored for language tasks, and a trio of critic models for precise feedback. Our experimental results in mathematical reasoning tasks demonstrate that AlphaLLM significantly enhances the performance of LLMs without additional annotations, showing the potential for self-improvement in LLMs.
翻译:尽管大型语言模型(LLMs)在各种任务上展现出令人印象深刻的能力,但在涉及复杂推理与规划的场景中仍面临挑战。近期研究提出了高级提示技术及基于高质量数据进行微调的必要性,以增强LLMs的推理能力。然而,这些方法本质上受限于数据的可用性与质量。鉴于此,自我校正与自我学习成为可行的解决方案,其采用策略使LLMs能够优化输出并从自我评估的奖励中学习。但LLMs在自我优化响应方面的效能——尤其在复杂推理与规划任务中——仍存疑。本文提出面向LLMs自我改进的AlphaLLM框架,该框架将蒙特卡洛树搜索(MCTS)与LLMs相结合以建立自我改进循环,从而无需额外标注即可提升LLMs的能力。受AlphaGo成功的启发,AlphaLLM解决了将MCTS与LLM结合以实现自我改进的特殊挑战,包括数据稀缺性、语言任务搜索空间的广阔性以及语言任务反馈的主观性。AlphaLLM由提示合成组件、专为语言任务定制的高效MCTS方法,以及提供精确反馈的三元批判模型组成。我们在数学推理任务中的实验结果表明,AlphaLLM能在无需额外标注的情况下显著提升LLMs的性能,展现了LLMs自我改进的潜力。