Recently, test-time scaling has garnered significant attention from the research community, largely due to the substantial advancements of the o1 model released by OpenAI. By allocating more computational resources during the inference phase, large language models~(LLMs) can extensively explore the solution space by generating more thought tokens or diverse solutions, thereby producing more accurate responses. However, developing an o1-like reasoning approach is challenging, and researchers have been making various attempts to advance this open area of research. In this paper, we present a preliminary exploration into enhancing the reasoning abilities of LLMs through reward-guided tree search algorithms. This framework is implemented by integrating the policy model, reward model, and search algorithm. It is primarily constructed around a tree search algorithm, where the policy model navigates a dynamically expanding tree guided by a specially trained reward model. We thoroughly explore various design considerations necessary for implementing this framework and provide a detailed report of the technical aspects. To assess the effectiveness of our approach, we focus on mathematical reasoning tasks and conduct extensive evaluations on four challenging datasets, significantly enhancing the reasoning abilities of LLMs.
翻译:近年来,测试时扩展因其显著提升模型推理能力而受到研究界的广泛关注,这在很大程度上归功于OpenAI发布的o1模型所取得的重大进展。通过在推理阶段分配更多计算资源,大语言模型(LLMs)能够通过生成更多思维令牌或多样化的解决方案来广泛探索解空间,从而产生更准确的响应。然而,开发类似o1的推理方法具有挑战性,研究人员一直在进行各种尝试以推进这一开放研究领域。本文提出了一种通过奖励引导的树搜索算法增强大语言模型推理能力的初步探索。该框架通过整合策略模型、奖励模型和搜索算法实现,主要围绕树搜索算法构建,其中策略模型在专门训练的奖励模型引导下遍历动态扩展的树结构。我们深入探讨了实现该框架所需的各种设计考量,并提供了技术细节的详细报告。为评估方法的有效性,我们聚焦于数学推理任务,在四个具有挑战性的数据集上进行了广泛评估,显著提升了大语言模型的推理能力。