Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with human preferences, it is less suitable for multi-step reasoning tasks because (1) DPO relies on paired preference data, which is not readily available for multi-step reasoning tasks, and (2) it treats all tokens uniformly, making it ineffective for credit assignment in multi-step reasoning tasks, which often come with sparse reward. In this work, we propose OREO (Offline Reasoning Optimization), an offline RL method for enhancing LLM multi-step reasoning. Building on insights from previous works of maximum entropy reinforcement learning, it jointly learns a policy model and value function by optimizing the soft Bellman Equation. We show in principle that it reduces the need to collect pairwise data and enables better credit assignment. Empirically, OREO surpasses existing offline learning methods on multi-step reasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH) and embodied agent control (ALFWorld). The approach can be extended to a multi-iteration framework when additional resources are available. Furthermore, the learned value function can be leveraged to guide the tree search for free, which can further boost performance during test time.
翻译:利用离线强化学习提升大语言模型的多步推理能力,对于使其快速适应复杂任务至关重要。尽管直接偏好优化在使大语言模型与人类偏好对齐方面展现出潜力,但其并不适用于多步推理任务,原因在于:(1) DPO依赖成对偏好数据,而此类数据在多步推理任务中难以获取;(2) 该方法对所有标记进行统一处理,导致其在奖励稀疏的多步推理任务中难以实现有效的信用分配。本研究提出OREO(离线推理优化),一种用于增强大语言模型多步推理能力的离线强化学习方法。该方法基于最大熵强化学习的前期研究洞见,通过优化软贝尔曼方程联合学习策略模型与价值函数。我们从理论上证明,该方法降低了对成对数据收集的需求,并实现了更优的信用分配。实证研究表明,OREO在多步推理基准测试(包括数学推理任务和具身智能体控制)中超越了现有离线学习方法。当具备额外资源时,该方法可扩展为多轮迭代框架。此外,习得的价值函数可被无偿用于引导树搜索,从而在测试阶段进一步提升性能。