Since the introduction of the GRPO algorithm, reinforcement learning~(RL) has attracted increasing attention for LLM post-training, yet training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are co-located on the same devices, and their synchronous execution prevents concurrent inference and training. In this work, we revisit the strategy of separating inference and training deployment, and propose a \emph{periodically asynchronous} framework that transforms synchronous RL training into an asynchronous producer--consumer pipeline. By synchronising model weights at the beginning of each training iteration and generating all rollouts from the same policy, the proposed framework remains inherently \emph{on-policy}, avoiding the off-policy bias introduced by existing asynchronous approaches without any modification to standard RL algorithms. We further introduce a unified tri-model architecture and a shared-prompt attention mechanism to support efficient asynchronous execution and reduce redundant computation. Experiments on NPU platforms show that the proposed framework achieves around $2\times$ throughput improvement from asynchronous execution, with additional gains from system-level optimisations, substantially outperforming mainstream RL frameworks in end-to-end training throughput while maintaining comparable accuracy. Further validation on GPU platforms confirms that the proposed framework generalises effectively across hardware architectures, indicating its potential for widespread application.
翻译:自GRPO算法提出以来,强化学习(RL)在大语言模型后训练中受到越来越多的关注,但训练效率仍然是一个关键挑战。在主流RL框架中,推理和训练被部署在同一设备上,其同步执行方式阻碍了推理与训练的并发进行。本研究重新审视了推理与训练分离部署的策略,提出了一种**周期性异步**框架,将同步RL训练转化为异步生产者-消费者流水线。通过在每次训练迭代开始时同步模型权重,并从同一策略生成所有轨迹,该框架本质上是**在策略**的,避免了现有异步方法引入的离策略偏差,且无需对标准RL算法进行任何修改。此外,我们引入统一的三模型架构和共享提示注意力机制,以支持高效的异步执行并减少冗余计算。在NPU平台上的实验表明,所提框架通过异步执行实现了约2倍的吞吐量提升,并通过系统级优化获得额外增益,在端到端训练吞吐量上显著优于主流RL框架,同时保持相当精度。在GPU平台上的进一步验证表明,该框架能够有效泛化至不同硬件架构,展现了其广泛应用的潜力。