Reinforcement learning (RL) has demonstrated immense potential in advancing artificial general intelligence, agentic intelligence, and embodied intelligence. However, the inherent heterogeneity and dynamicity of RL workflows often lead to low hardware utilization and slow training on existing systems. In this paper, we present RLinf, a high-performance RL training system based on our key observation that the major roadblock to efficient RL training lies in system flexibility. To maximize flexibility and efficiency, RLinf is built atop a novel RL system design paradigm called macro-to-micro flow transformation (M2Flow), which automatically breaks down high-level, easy-to-compose RL workflows at both the temporal and spatial dimensions, and recomposes them into optimized execution flows. Supported by RLinf worker's adaptive communication capability, we devise context switching and elastic pipelining to realize M2Flow transformation, and a profiling-guided scheduling policy to generate optimal execution plans. Extensive evaluations on both reasoning RL and embodied RL tasks demonstrate that RLinf consistently outperforms state-of-the-art systems, achieving $1.07\times-2.43\times$ speedup in end-to-end training throughput.
翻译:强化学习(RL)在推动通用人工智能、智能体智能与具身智能发展方面展现出巨大潜力。然而,强化学习工作流固有的异构性与动态性,常导致现有系统硬件利用率低下、训练速度缓慢。本文提出RLinf,一个高性能强化学习训练系统,其核心洞见在于:高效强化学习训练的主要障碍源于系统灵活性不足。为最大化灵活性与效率,RLinf基于一种新颖的强化学习系统设计范式——宏微观流程转换(M2Flow)构建。该范式能自动在时空维度上分解高层次、易组合的强化学习工作流,并将其重组为优化的执行流程。借助RLinf工作者节点的自适应通信能力,我们设计了上下文切换与弹性流水线机制以实现M2Flow转换,并采用性能剖析引导的调度策略生成最优执行计划。在推理型强化学习与具身强化学习任务上的广泛实验表明,RLinf持续优于现有最优系统,端到端训练吞吐量达到$1.07\times-2.43\times$的加速比。