The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.
翻译:具备强大推理能力的大型语言模型(LLM)的出现标志着一个重要里程碑,为复杂问题求解开辟了新前沿。然而,训练这些推理模型(通常使用强化学习(RL))面临关键效率瓶颈:RL训练过程中的响应生成呈现持续的长尾分布,其中少数极长响应主导了执行时间,造成资源浪费和成本激增。为解决此问题,我们提出TLT系统,该系统通过集成自适应推测解码,以无损方式加速推理RL训练。在RL中应用推测解码具有挑战性,原因在于动态工作负载、持续演进的目标模型以及草稿模型训练开销。TLT通过两个协同组件克服这些障碍:(1)自适应草稿器——一种轻量级草稿模型,在长尾生成期间利用空闲GPU持续训练,以零额外成本保持与目标模型的对齐;(2)自适应执行引擎——维护一个内存高效的预捕获CUDAGraph池,并为每个输入批次自适应选择合适的推测解码策略。评估表明,相较于最先进系统,TLT实现了超过1.7倍的端到端RL训练加速,保持了模型精度,并产生一个高质量草稿模型作为适用于高效部署的免费副产品。代码发布于 https://github.com/mit-han-lab/fastrl。