In this paper, we propose R$^3$: Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL), a novel method that employs only outcome supervision to achieve the benefits of process supervision for large language models. The core challenge in applying RL to complex reasoning is to identify a sequence of actions that result in positive rewards and provide appropriate supervision for optimization. Outcome supervision provides sparse rewards for final results without identifying error locations, whereas process supervision offers step-wise rewards but requires extensive manual annotation. R$^3$ overcomes these limitations by learning from correct demonstrations. Specifically, R$^3$ progressively slides the start state of reasoning from a demonstration's end to its beginning, facilitating easier model exploration at all stages. Thus, R$^3$ establishes a step-wise curriculum, allowing outcome supervision to offer step-level signals and precisely pinpoint errors. Using Llama2-7B, our method surpasses RL baseline on eight reasoning tasks by $4.1$ points on average. Notebaly, in program-based reasoning on GSM8K, it exceeds the baseline by $4.2$ points across three backbone models, and without any extra data, Codellama-7B + R$^3$ performs comparable to larger models or closed-source models.
翻译:本文提出R$^3$:基于反向课程强化学习的推理学习方法,这是一种仅使用结果监督即可实现过程监督优势的新型大型语言模型训练方法。将强化学习应用于复杂推理的核心挑战在于:识别能产生正奖励的动作序列,并为优化过程提供适当监督。结果监督仅对最终结果提供稀疏奖励,无法定位错误位置;而过程监督虽提供逐步骤奖励,但需要大量人工标注。R$^3$通过从正确演示中学习克服了这些限制。具体而言,R$^3$将推理的起始状态从演示的终点逐步滑动至起点,从而在推理的各阶段降低模型探索难度。由此,R$^3$建立了逐步课程机制,使得结果监督能够提供步骤级信号并精确定位错误。基于Llama2-7B模型,我们的方法在八个推理任务上平均超越强化学习基线4.1个百分点。值得注意的是,在GSM8K的程序化推理任务中,该方法在三个骨干模型上均超过基线4.2个百分点,且无需额外数据,Codellama-7B + R$^3$的性能即可与更大规模模型或闭源模型相当。