Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method. Our code can be found on https://github.com/divelab/E2H-Reasoning.

翻译：我们旨在通过强化学习提升语言模型的推理能力。近期经过强化学习后训练的模型（如DeepSeek-R1）已在数学和编程任务中展现出推理能力。然而，先前研究表明，单独使用强化学习改进固有困难任务的推理效果有限。受课程学习启发，我们提出按从易到难（E2H）的顺序调度任务，使大语言模型能够逐步构建推理技能。该方法称为E2H推理器。实验表明，尽管简单任务在初期至关重要，但通过适当的调度逐步淡化这些任务对于防止过拟合具有关键作用。在理论层面，我们在近似策略迭代框架内为E2H推理器建立了收敛性保证。我们推导了有限样本复杂度边界，并证明当任务被恰当分解和条件化时，通过课程阶段进行学习所需的总样本量少于直接学习。跨多个领域的实验表明，E2H推理器显著提升了中小规模大语言模型（1.5B至3B参数）的推理能力——这些模型若仅使用原始强化学习训练则表现欠佳，这凸显了我们方法的有效性。代码可在 https://github.com/divelab/E2H-Reasoning 获取。

相关内容

课程

关注 6

课程是指学校学生所应学习的学科总和及其进程与安排。课程是对教育的目标、教学内容、教学活动方式的规划和设计，是教学计划、教学大纲等诸多方面实施过程的总和。广义的课程是指学校为实现培养目标而选择的教育内容及其进程的总和，它包括学校老师所教授的各门学科和有目的、有计划的教育活动。狭义的课程是指某一门学科。专知上对国内外最新AI+X的课程进行了收集与索引，涵盖斯坦福大学、CMU、MIT、清华、北大等名校开放课程。

【AAAI2026】善始则事半功倍：基于前缀优化的大语言模型推理强化学习

专知会员服务

13+阅读 · 2025年12月19日

强化学习遇见大语言模型：贯穿 LLM 生命周期的进展与应用综述

专知会员服务

38+阅读 · 2025年9月23日

面向大型推理模型的强化学习综述

专知会员服务

29+阅读 · 2025年9月11日

强化多模态大语言模型：基于强化学习的推理综述

专知会员服务

37+阅读 · 2025年5月3日