Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student's performance on seen samples, the teacher continuously adapts to the student's evolving abilities. On OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.
翻译:强化学习已成为解锁大型语言模型推理能力的有力范式。然而,依赖稀疏奖励使得这一过程样本效率极低,因为模型必须在反馈极少的情况下探索巨大的搜索空间。尽管经典课程学习旨在通过按复杂度排序数据来缓解这一问题,但针对特定模型的正确排序往往难以确定。为此,我们提出Goldilocks——一种新颖的教师驱动数据采样策略,旨在预测每个问题对学生模型的难度。教师模型依据Goldilocks原则(即选择难度适中、既不过易也不过难的问题)为学生模型筛选合适难度的问题,同时使用GRPO方法训练学生模型。通过利用学生在已见样本上的表现,教师模型能持续适应学生不断演进的能力。在OpenMathReasoning数据集上的实验表明,在相同计算预算下,Goldilocks数据采样显著提升了采用标准GRPO训练模型的性能。