Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, our approach consistently delivers significant end-to-end training efficiency improvements on NPU platforms, indicating its potential for widespread application.
翻译:自GRPO算法提出以来,强化学习(RL)日益受到关注,相关复现与应用尝试不断增多。然而,训练效率仍是关键挑战。主流RL框架通常将推理与训练部署于同一设备。尽管该方案通过资源整合降低了成本,但其同步执行方式带来了计算耦合,导致推理与训练无法并发执行。本研究回归推理与训练分离部署的策略,并通过改进数据加载器,将传统的同步架构转变为周期性异步框架。该框架允许各组件按需进行独立、弹性的扩展,同时算法精度与同步方法完全等价,二者均属于同策略范畴。值得强调的是,我们在训练阶段采用了统一的三模型架构,并提出了共享提示注意力掩码以减少重复计算。实践表明,我们的方法在NPU平台上持续带来显著的端到端训练效率提升,显示出其广泛应用的潜力。