Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.
翻译:强化学习(RL)已成为现代大语言模型的关键组成部分,但推出阶段仍是RL训练流程中的核心瓶颈。尽管多词预测(MTP)通过推测性解码为加速推出提供了自然解决方案,但许多研究表明,MTP接受率在RL训练期间会显著下降,导致加速性能受限。为解决这一瓶颈,我们提出Bebop,对MTP在大语言模型后训练中的应用进行系统研究,并提供将MTP集成到大规模RL流程中的实用方案。首先,我们发现MTP接受率从根本上受限于模型熵的波动,该波动与RL阶段熵的上升呈现明确的负线性关系。其次,我们证明相较于贪心草稿采样,概率拒绝采样能大幅缓解RL中熵引入的干扰。进一步,我们发现传统MTP训练目标(交叉熵或KL散度)在此类场景中表现次优,因此提出一种新型端到端总变差损失,直接优化多步拒绝采样接受率,使接受率提升约10%,在数学推理、代码生成和智能体任务中达到高达95%的接受率及高达25%的额外推理吞吐量增益。第三,我们测试了RL过程中多种在线MTP训练策略,表明采用端到端总变差损失与拒绝采样的预RL MTP训练,能在整个RL过程中保持稳定的接受率和加速效果,从而消除昂贵的在线MTP更新需求。我们通过大量实验与分析验证了这一发现。实验结果表明,该方法在Qwen3.5、Qwen3.6和Qwen3.7模型的异步RL训练中实现了高达1.8倍的端到端加速。