Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.
翻译:诸如 OpenAI o1 和 DeepSeek-R1 等大型推理模型(LRMs)在使用长推理链的推理任务中展现出卓越性能。然而,这也导致了计算成本的大幅增加以及冗长输出的生成,即所谓的“过度思考”现象。这种过度思考的倾向常因 GRPO/DAPO 等强化学习(RL)算法而加剧。本文提出 BFS-PO,一种采用最佳优先搜索探索策略以缓解此问题的强化学习算法。具体而言,BFS-PO 通过基于最大熵节点的回溯机制来寻找最短正确答案。通过在训练过程中逐步生成更简短的响应,BFS-PO 得以学会生成简洁的推理链。我们在不同基准测试和基础 LRM 上的实验表明,BFS-PO 能够同时提升 LRM 的准确率并缩短其答案长度。