Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning

Zhiheng Xi,Wenxiang Chen,Boyang Hong,Senjie Jin,Rui Zheng,Wei He,Yiwen Ding,Shichun Liu,Xin Guo,Junzhe Wang,Honglin Guo,Wei Shen,Xiaoran Fan,Yuhao Zhou,Shihan Dou,Xiao Wang,Xinbo Zhang,Peng Sun,Tao Gui,Qi Zhang,Xuanjing Huang

from arxiv, Preprint. Codes released: https://github.com/WooooDyy/LLM-Reverse-Curriculum-RL

In this paper, we propose R$^3$: Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL), a novel method that employs only outcome supervision to achieve the benefits of process supervision for large language models. The core challenge in applying RL to complex reasoning is to identify a sequence of actions that result in positive rewards and provide appropriate supervision for optimization. Outcome supervision provides sparse rewards for final results without identifying error locations, whereas process supervision offers step-wise rewards but requires extensive manual annotation. R$^3$ overcomes these limitations by learning from correct demonstrations. Specifically, R$^3$ progressively slides the start state of reasoning from a demonstration's end to its beginning, facilitating easier model exploration at all stages. Thus, R$^3$ establishes a step-wise curriculum, allowing outcome supervision to offer step-level signals and precisely pinpoint errors. Using Llama2-7B, our method surpasses RL baseline on eight reasoning tasks by $4.1$ points on average. Notebaly, in program-based reasoning on GSM8K, it exceeds the baseline by $4.2$ points across three backbone models, and without any extra data, Codellama-7B + R$^3$ performs comparable to larger models or closed-source models.

翻译：本文提出R$^3$：一种通过反向课程强化学习进行推理学习的新方法，该方法仅利用结果监督即可实现过程监督对大型语言模型的优势。将强化学习应用于复杂推理的核心挑战在于识别能产生正奖励的动作序列，并为优化提供适当监督。结果监督仅对最终结果提供稀疏奖励，无法定位错误位置；而过程监督虽能提供逐步骤奖励，但需要大量人工标注。R$^3$通过从正确示范中学习克服了这些限制。具体而言，R$^3$逐步将推理起始状态从示范的末端滑动至起始端，从而在各阶段促进模型探索。由此，R$^3$建立了逐步骤课程，使结果监督能够提供步骤级信号并精确定位错误。基于Llama2-7B，本方法在八项推理任务上平均超越强化学习基线4.1个点。值得注意的是，在GSM8K的程序式推理中，本方法在三个骨干模型上均超过基线4.2个点，且无需额外数据，Codellama-7B + R$^3$的性能即可与更大模型或闭源模型相媲美。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日