Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search

Jinhao Jiang,Zhipeng Chen,Yingqian Min,Jie Chen,Xiaoxue Cheng,Jiapeng Wang,Yiru Tang,Haoxiang Sun,Jia Deng,Wayne Xin Zhao,Zheng Liu,Dong Yan,Jian Xie,Zhongyuan Wang,Ji-Rong Wen

from arxiv, Technical Report on Slow Thinking with LLMs: I

Recently, test-time scaling has garnered significant attention from the research community, largely due to the substantial advancements of the o1 model released by OpenAI. By allocating more computational resources during the inference phase, large language models~(LLMs) can extensively explore the solution space by generating more thought tokens or diverse solutions, thereby producing more accurate responses. However, developing an o1-like reasoning approach is challenging, and researchers have been making various attempts to advance this open area of research. In this paper, we present a preliminary exploration into enhancing the reasoning abilities of LLMs through reward-guided tree search algorithms. This framework is implemented by integrating the policy model, reward model, and search algorithm. It is primarily constructed around a tree search algorithm, where the policy model navigates a dynamically expanding tree guided by a specially trained reward model. We thoroughly explore various design considerations necessary for implementing this framework and provide a detailed report of the technical aspects. To assess the effectiveness of our approach, we focus on mathematical reasoning tasks and conduct extensive evaluations on four challenging datasets, significantly enhancing the reasoning abilities of LLMs.

翻译：近年来，测试时扩展因其显著提升模型推理能力而受到研究界的广泛关注，这在很大程度上归功于OpenAI发布的o1模型所取得的重大进展。通过在推理阶段分配更多计算资源，大语言模型（LLMs）能够通过生成更多思维令牌或多样化的解决方案来广泛探索解空间，从而产生更准确的响应。然而，开发类似o1的推理方法具有挑战性，研究人员一直在进行各种尝试以推进这一开放研究领域。本文提出了一种通过奖励引导的树搜索算法增强大语言模型推理能力的初步探索。该框架通过整合策略模型、奖励模型和搜索算法实现，主要围绕树搜索算法构建，其中策略模型在专门训练的奖励模型引导下遍历动态扩展的树结构。我们深入探讨了实现该框架所需的各种设计考量，并提供了技术细节的详细报告。为评估方法的有效性，我们聚焦于数学推理任务，在四个具有挑战性的数据集上进行了广泛评估，显著提升了大语言模型的推理能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日