Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each batch of data only once, discard it, and then collect fresh samples, resulting in poor sample efficiency. In this work, we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning. We propose ReVal, a Bellman-update-based method that combines stepwise signals capturing internal consistency with trajectory-level signals derived from outcome verification. ReVal naturally supports replay-buffer-based training, allowing efficient reuse of past trajectories. Experiments on standard mathematical reasoning benchmarks show that ReVal not only converges faster but also outperforms GRPO in final performance. On DeepSeek-R1-Distill-1.5B, ReVal improves training efficiency and achieves improvement of 2.7% in AIME24 and 4.5% in out-of-domain benchmark GPQA over GRPO. These results suggest that value-based RL is a practical alternative to policy-based methods for LLM training.
翻译:提升数据利用效率对于扩展强化学习在长时域任务中的应用至关重要,因为在这类任务中生成轨迹的成本较高。然而,当前主流的大语言模型强化学习方法主要采用在策略方式:每个批次数据仅更新一次便丢弃,随后收集全新样本,导致样本效率低下。本文探索了一种天然支持离策略学习的大语言模型值函数型强化学习框架。我们提出ReVal方法,这是一种基于贝尔曼更新的混合方法,既利用逐步信号捕捉内部一致性,又融合轨迹级信号(源自结果验证)。ReVal天然支持基于经验回放缓冲区的训练机制,可高效复用历史轨迹。在标准数学推理基准上的实验表明,ReVal不仅收敛速度更快,且最终性能优于GRPO方法。在DeepSeek-R1-Distill-1.5B模型上,ReVal提升训练效率,并在AIME24基准上实现2.7%的性能提升,在域外基准GPQA上提升4.5%。这些结果表明,对于大语言模型训练,值函数型强化学习是策略型方法的可行替代方案。