Within the domain of large language models, reinforcement fine-tuning algorithms necessitate the generation of a complete reasoning trajectory beginning from the input query, which incurs significant computational overhead during the rollout phase of training. To address this issue, we analyze the impact of different segments of the reasoning path on the correctness of the final result and, based on these insights, propose Reinforcement Fine-Tuning with Partial Reasoning Optimization (RPO), a plug-and-play reinforcement fine-tuning algorithm. Unlike traditional reinforcement fine-tuning algorithms that generate full reasoning paths, RPO trains the model by generating suffixes of the reasoning path using experience cache. During the rollout phase of training, RPO reduces token generation in this phase by approximately 95%, greatly lowering the theoretical time overhead. Compared with full-path reinforcement fine-tuning algorithms, RPO reduces the training time of the 1.5B model by 90% and the 7B model by 72%. At the same time, it can be integrated with typical algorithms such as GRPO and DAPO, enabling them to achieve training acceleration while maintaining performance comparable to the original algorithms. Our code is open-sourced at https://github.com/yhz5613813/RPO.
翻译:在大语言模型领域,强化微调算法需要从输入查询开始生成完整的推理轨迹,这在训练的rollout阶段会产生显著的计算开销。为解决这一问题,我们分析了推理路径不同片段对最终结果正确性的影响,并基于此提出了基于部分推理优化的强化微调算法(RPO),这是一种即插即用的强化微调算法。与生成完整推理路径的传统强化微调算法不同,RPO通过利用经验缓存生成推理路径的后缀来训练模型。在训练的rollout阶段,RPO将该阶段的令牌生成量减少了约95%,极大降低了理论时间开销。与全路径强化微调算法相比,RPO将1.5B模型的训练时间减少了90%,将7B模型的训练时间减少了72%。同时,该算法可与GRPO、DAPO等典型算法结合,使其在保持与原算法相当性能的同时实现训练加速。我们的代码已在https://github.com/yhz5613813/RPO开源。