Reinforcement finetuning (RFT) has shown great potential for enhancing the mathematical reasoning capabilities of large language models (LLMs), but it is often sample- and compute-inefficient, requiring extensive training. In this work, we introduce AdaRFT (Adaptive Curriculum Reinforcement Finetuning), a method that significantly improves both the efficiency and final accuracy of RFT through adaptive curriculum learning. AdaRFT dynamically adjusts the difficulty of training problems based on the model's recent reward signals, ensuring that the model consistently trains on tasks that are challenging but solvable. This adaptive sampling strategy accelerates learning by maintaining an optimal difficulty range, avoiding wasted computation on problems that are too easy or too hard. AdaRFT requires only a lightweight extension to standard RFT algorithms like Proximal Policy Optimization (PPO), without modifying the reward function or model architecture. Experiments on competition-level math datasets demonstrate that AdaRFT significantly improves both training efficiency and reasoning performance. We evaluate AdaRFT across multiple data distributions and model sizes, showing that it reduces training time by up to 2x and improves accuracy by a considerable margin, offering a more scalable and effective RFT framework.
翻译:强化微调(RFT)在提升大语言模型(LLMs)的数学推理能力方面展现出巨大潜力,但其通常样本与计算效率低下,需要大量训练。本文提出AdaRFT(自适应课程强化微调),一种通过自适应课程学习显著提升RFT效率和最终准确率的方法。AdaRFT根据模型近期的奖励信号动态调整训练问题的难度,确保模型持续在具有挑战性但可解决的任务上进行训练。这种自适应采样策略通过维持最优难度区间来加速学习,避免在过于简单或困难的问题上浪费计算资源。AdaRFT仅需对标准RFT算法(如近端策略优化PPO)进行轻量级扩展,无需修改奖励函数或模型架构。在竞赛级数学数据集上的实验表明,AdaRFT显著提升了训练效率和推理性能。我们在多种数据分布和模型规模下评估AdaRFT,结果显示其可将训练时间减少高达2倍,并显著提高准确率,从而提供了一个更具可扩展性和有效性的RFT框架。