Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we "warm up" the model by distilling Long CoTs from a toy domain, namely, Knights \& Knaves (K\&K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: $(i)$ the warmup phase alone facilitates generalized reasoning, leading to performance improvements across a range of tasks, including MATH, HumanEval$^{+}$, and MMLU-Pro; $(ii)$ When both the base model and the warmed-up model are RLVR trained on the same small dataset ($\leq100$ examples), the warmed-up model consistently outperforms the base model; $(iii)$ Warming up before RLVR training allows a model to maintain cross-domain generalizability even after training on a specific domain; $(iv)$ Introducing warmup in the pipeline improves not only accuracy but also overall sample efficiency during RLVR training. The results in this paper highlight the promise of warmup for building robust reasoning LLMs in data-scarce environments.
翻译:设计具备有效推理能力的大型语言模型通常需要使用可验证奖励的强化学习或精心策划的长思维链进行知识蒸馏,这两种方法都严重依赖大量训练数据。当高质量训练数据稀缺时,这构成了重大挑战。我们提出一种样本高效的两阶段训练策略,以在有限监督下开发具备推理能力的大型语言模型。在第一阶段,我们通过从玩具领域(即骑士与无赖逻辑谜题)中蒸馏长思维链来“预热”模型,使其获得通用推理技能。在第二阶段,我们使用有限的目标领域示例对预热后的模型应用可验证奖励的强化学习。我们的实验证明这种两阶段方法具有以下优势:$(i)$ 仅预热阶段就能促进泛化推理能力,从而在包括MATH、HumanEval$^{+}$和MMLU-Pro在内的一系列任务上带来性能提升;$(ii)$ 当基础模型和预热模型在同一小型数据集($\leq100$个示例)上接受可验证奖励的强化学习训练时,预热模型始终优于基础模型;$(iii)$ 在可验证奖励的强化学习训练前进行预热,能使模型即使在特定领域训练后仍保持跨领域泛化能力;$(iv)$ 在训练流程中引入预热不仅能提高准确性,还能提升可验证奖励的强化学习训练期间的总体样本效率。本文结果凸显了在数据稀缺环境中通过预热构建鲁棒推理大型语言模型的潜力。