Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where the model uses its own performance to decide which problems to focus training on, provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum decouples the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples, and requiring no assumption on the distribution or difficulty of prompts.
翻译:链式推理(Chain-of-thought reasoning)要求语言模型在生成最终回答前,通过产生思考令牌(thinking tokens)来消耗额外计算,这显著提升了模型能力。然而,训练这类推理模型在数据和计算方面极为昂贵,因为需要从人类或合成生成器收集长序列推理行为,并通过强化学习对模型进行后续训练。这些成本是固有局限,还是能通过更优算法设计降低?我们证明,自动课程(autocurriculum)——即模型利用自身性能决定训练重点问题——在监督微调(SFT)和强化学习(RL)中均能显著优于标准训练方案。对于SFT,自动课程通过将教师监督聚焦于当前模型难以处理的提示,所需推理演示数量比非自适应微调呈指数级减少。对于RL微调,自动课程将计算成本与参考模型质量解耦,后者退化为几乎与目标精度无关的初始燃耗成本。这些改进纯粹源于自适应数据选择,借鉴了提升方法(boosting)和反例学习(learning from counterexamples)的经典技术,且无需对提示分布或难度做任何假设。