We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method. Our code can be found on https://github.com/divelab/E2H-Reasoning.
翻译:我们旨在通过强化学习提升语言模型的推理能力。近期经过强化学习后训练的模型,如DeepSeek-R1,已在数学和编程任务上展现出推理能力。然而,先前研究表明,仅使用强化学习来改进本质上困难任务的推理效果有限。受课程学习启发,我们提出从易到难的任务调度方法,使大语言模型能够逐步构建推理技能。该方法称为E2H Reasoner。实证研究表明,虽然简单任务在初期至关重要,但通过适当的调度逐步淡化它们对于防止过拟合至关重要。理论上,我们在近似策略迭代框架内为E2H Reasoner建立了收敛保证。我们推导了有限样本复杂度界限,并证明当任务被适当分解和条件化时,通过课程阶段学习所需的总样本量少于直接学习。跨多个领域的实验表明,E2H Reasoner显著提升了小型大语言模型的推理能力,这些模型在仅使用普通强化学习训练时表现不佳,凸显了本方法的有效性。代码可在https://github.com/divelab/E2H-Reasoning获取。