The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.
翻译:具备强推理能力的大语言模型(LLM)的出现标志着重大里程碑,为复杂问题求解开辟了新前沿。然而,训练这些推理模型(通常采用强化学习(RL))面临着关键效率瓶颈:RL训练过程中的响应生成呈现持续的长尾分布,少数超长响应主导执行时间,造成资源浪费与成本激增。为此,我们提出TLT系统,通过集成自适应推测解码无损加速推理RL训练。由于动态工作负载、持续演化的目标模型及起草模型训练开销,在RL中应用推测解码颇具挑战。TLT通过两个协同组件克服这些障碍:(1)自适应起草器(Adaptive Drafter),一个轻量级起草模型,在长尾生成期间利用空闲GPU持续训练,以零额外成本维持与目标模型的对齐;(2)自适应生成长度引擎(Adaptive Rollout Engine),维护预捕获CUDAGraphs的内存高效池,并为每个输入批次自适应选择合适SD策略。评估表明,相较现有最优系统,TLT实现端到端RL训练加速超1.7倍,保持模型精度,并作为免费副产品产出适合高效部署的高质量起草模型。代码已开源至https://github.com/mit-han-lab/fastrl。