Large Language Models (LLMs) increasingly rely on reinforcement learning with verifiable rewards (RLVR) to elicit reliable chain-of-thought reasoning. However, the training process remains bottlenecked by the computationally expensive rollout stage. Existing acceleration methods-such as parallelization, objective- and data-driven modifications, and replay buffers-either incur diminishing returns, introduce bias, or overlook redundancy across iterations. We identify that rollouts from consecutive training epochs frequently share a large portion of overlapping segments, wasting computation. To address this, we propose SPEC-RL, a novel framework that integrates SPECulative decoding with the RL rollout process. SPEC-RL reuses prior trajectory segments as speculative prefixes and extends them via a draft-and-verify mechanism, avoiding redundant generation while ensuring policy consistency. Experiments on diverse math reasoning and generalization benchmarks, including AIME24, MATH-500, OlympiadBench, MMLU-STEM, and others, demonstrate that SPEC-RL reduces rollout time by 2-3x without compromising policy quality. As a purely rollout-stage enhancement, SPEC-RL integrates seamlessly with mainstream algorithms (e.g., PPO, GRPO, DAPO), offering a general and practical path to scale RLVR for large reasoning models. Our code is available at https://github.com/ShopeeLLM/Spec-RL
翻译:大型语言模型(LLM)日益依赖基于可验证奖励的强化学习(RLVR)来激发可靠的思维链推理。然而,训练过程仍受限于计算成本高昂的轨迹生成阶段。现有的加速方法——如并行化、目标与数据驱动的修改以及经验回放缓冲区——要么收益递减,要么引入偏差,或忽视了迭代间的冗余。我们发现,连续训练周期生成的轨迹常包含大量重叠片段,导致计算资源浪费。为解决此问题,我们提出SPEC-RL,一个将推测式解码与RL轨迹生成过程相结合的新框架。SPEC-RL复用先前的轨迹片段作为推测前缀,并通过草稿-验证机制进行扩展,从而在保证策略一致性的同时避免冗余生成。在包括AIME24、MATH-500、OlympiadBench、MMLU-STEM等多样化数学推理与泛化基准上的实验表明,SPEC-RL在不牺牲策略质量的前提下,将轨迹生成时间缩短了2-3倍。作为一种纯轨迹阶段的增强方法,SPEC-RL可无缝集成于主流算法(如PPO、GRPO、DAPO),为大规模推理模型的RLVR扩展提供了一条通用且实用的路径。我们的代码公开于 https://github.com/ShopeeLLM/Spec-RL