Large Language Models (LLMs) increasingly rely on reinforcement learning with verifiable rewards (RLVR) to elicit reliable chain-of-thought reasoning. However, the training process remains bottlenecked by the computationally expensive rollout stage. Existing acceleration methods-such as parallelization, objective- and data-driven modifications, and replay buffers-either incur diminishing returns, introduce bias, or overlook redundancy across iterations. We identify that rollouts from consecutive training epochs frequently share a large portion of overlapping segments, wasting computation. To address this, we propose SPEC-RL, a novel framework that integrates SPECulative decoding with the RL rollout process. SPEC-RL reuses prior trajectory segments as speculative prefixes and extends them via a draft-and-verify mechanism, avoiding redundant generation while ensuring policy consistency. Experiments on diverse math reasoning and generalization benchmarks, including AIME24, MATH-500, OlympiadBench, MMLU-STEM, and others, demonstrate that SPEC-RL reduces rollout time by 2-3x without compromising policy quality. As a purely rollout-stage enhancement, SPEC-RL integrates seamlessly with mainstream algorithms (e.g., PPO, GRPO, DAPO), offering a general and practical path to scale RLVR for large reasoning models. Our code is available at https://github.com/ShopeeLLM/Spec-RL
翻译:大型语言模型(LLMs)日益依赖基于可验证奖励的强化学习(RLVR)来激发可靠的思维链推理。然而,训练过程仍受限于计算成本高昂的轨迹展开阶段。现有的加速方法——如并行化、目标与数据驱动的修改以及经验回放缓冲区——要么收益递减,要么引入偏差,或忽视了迭代间的冗余。我们发现连续训练周期中的轨迹展开常包含大量重叠片段,导致计算浪费。为解决此问题,我们提出SPEC-RL,这是一个将推测式解码与RL轨迹展开过程相结合的新型框架。SPEC-RL复用先前的轨迹片段作为推测前缀,并通过草稿-验证机制进行扩展,从而避免冗余生成,同时确保策略一致性。在多样化数学推理与泛化基准测试(包括AIME24、MATH-500、OlympiadBench、MMLU-STEM等)上的实验表明,SPEC-RL在不影响策略质量的前提下将轨迹展开时间缩短了2-3倍。作为一种纯轨迹展开阶段的增强方法,SPEC-RL可与主流算法(如PPO、GRPO、DAPO)无缝集成,为大规模推理模型的RLVR扩展提供了一条通用且实用的路径。我们的代码公开于https://github.com/ShopeeLLM/Spec-RL。