Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework. Code available at: https://github.com/SakanaAI/RLT
翻译:通过强化学习(RL)训练推理语言模型(LMs)以追求单次正确性,本质上依赖于模型在初始化阶段能够通过探索以一定概率解决任务。此外,推理语言模型的一个关键应用场景是作为教师模型,用于蒸馏新学生模型以及冷启动后续的强化学习迭代,而非直接部署使用。基于这些考量,我们提出了一种新框架,通过训练一类专注于产生最有效下游蒸馏效果的强化学习教师模型(RLTs),避免了强化学习中的探索难题。RLTs 的输入包含每个问题及其解答,其任务是以适合学生模型的方式,通过详细解释“连接要点”。我们通过将每个解释输入学生模型并测试其对问题解答的理解,获得密集奖励来训练 RLTs。在实际应用中,一个 7B 参数的 RLT 原始输出,在竞赛和研究生水平任务上的最终性能,优于现有蒸馏和冷启动流程——这些流程需要收集并后处理规模大数个数量级的语言模型的推理轨迹。此外,RLTs 在训练更大规模学生模型时以及零样本应用于分布外任务时仍保持有效性,为强化学习推理框架开启了效率与可重用性的新层次。代码发布于:https://github.com/SakanaAI/RLT