Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where the model uses its own performance to decide which problems to focus training on, provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum decouples the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples, and requiring no assumption on the distribution or difficulty of prompts.

翻译：链式推理（Chain-of-thought reasoning）要求语言模型在生成最终回答前，通过产生思考令牌（thinking tokens）来消耗额外计算，这显著提升了模型能力。然而，训练这类推理模型在数据和计算方面极为昂贵，因为需要从人类或合成生成器收集长序列推理行为，并通过强化学习对模型进行后续训练。这些成本是固有局限，还是能通过更优算法设计降低？我们证明，自动课程（autocurriculum）——即模型利用自身性能决定训练重点问题——在监督微调（SFT）和强化学习（RL）中均能显著优于标准训练方案。对于SFT，自动课程通过将教师监督聚焦于当前模型难以处理的提示，所需推理演示数量比非自适应微调呈指数级减少。对于RL微调，自动课程将计算成本与参考模型质量解耦，后者退化为几乎与目标精度无关的初始燃耗成本。这些改进纯粹源于自适应数据选择，借鉴了提升方法（boosting）和反例学习（learning from counterexamples）的经典技术，且无需对提示分布或难度做任何假设。

相关内容

课程

关注 6

课程是指学校学生所应学习的学科总和及其进程与安排。课程是对教育的目标、教学内容、教学活动方式的规划和设计，是教学计划、教学大纲等诸多方面实施过程的总和。广义的课程是指学校为实现培养目标而选择的教育内容及其进程的总和，它包括学校老师所教授的各门学科和有目的、有计划的教育活动。狭义的课程是指某一门学科。专知上对国内外最新AI+X的课程进行了收集与索引，涵盖斯坦福大学、CMU、MIT、清华、北大等名校开放课程。

【ICLR2026】缩放推理步数暴露短板：揭示并提升大语言模型中的步数泛化能力

专知会员服务

10+阅读 · 2月1日

从感知到推理：深度思考赋能多模态大语言模型

专知会员服务

25+阅读 · 2025年11月19日

如何提升大模型通用推理能力？DeepSeek最新论文《CODEI/O：通过代码输入输出预测凝练推理模式》

专知会员服务

42+阅读 · 2025年2月16日

【博士论文】推理的表示学习：跨多样结构的泛化

专知会员服务

27+阅读 · 2024年10月20日