Large Language Models (LLMs) have shown remarkable performance on complex reasoning tasks, especially when equipped with long chain-of-thought (CoT) reasoning. However, eliciting long CoT typically requires large-scale reinforcement learning (RL) training, while often leading to overthinking with redundant intermediate steps. To improve learning and reasoning efficiency, while preserving or even enhancing performance, we propose TACLer, a model-tailored curriculum reinforcement learning framework that gradually increases the complexity of the data based on the model's proficiency in multi-stage RL training. TACLer features two core components: (i) tailored curriculum learning that determines what knowledge the model lacks and needs to learn in progressive stages; (ii) a hybrid Thinking/NoThinking reasoning paradigm that balances accuracy and efficiency by enabling or disabling the Thinking mode. Our experiments show that TACLer yields a twofold advantage in learning and reasoning: (i) it reduces computational cost, cutting training compute by over 50% compared to long thinking models and reducing inference token usage by over 42% relative to the base model; and (ii) it improves accuracy by over 9% on the base model, consistently outperforming state-of-the-art Nothinking and Thinking baselines across four math datasets with complex problems.
翻译:大型语言模型(LLM)在复杂推理任务上展现出卓越性能,尤其在配备长链思维(CoT)推理机制时表现突出。然而,激发长链思维通常需要大规模强化学习(RL)训练,且常因冗余中间步骤导致过度思考。为在保持甚至提升性能的同时提高学习与推理效率,我们提出TACLer——一种基于模型能力定制课程的多阶段强化学习框架,能根据模型熟练度逐步提升数据复杂度。TACLer包含两个核心组件:(一)定制化课程学习机制,动态判定模型在渐进式训练阶段中缺乏且需掌握的知识;(二)混合型“思考/无思考”推理范式,通过启用或禁用思考模式平衡准确性与效率。实验表明TACLer在学习和推理方面具有双重优势:(一)显著降低计算成本,相比长思维模型减少超过50%的训练算力,相较于基础模型减少超过42%的推理令牌消耗;(二)在基础模型上实现超过9%的准确率提升,在四个包含复杂问题的数学数据集上持续超越当前最先进的“无思考”与“思考”基线模型。