Reinforcement learning (RL) post-training has proven effective at unlocking reasoning, self-reflection, and tool-use capabilities in large language models. As models extend to omni-modal inputs and agentic multi-turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness -- throughput tradeoff. We present \textbf{Relax} (Reinforcement Engine Leveraging Agentic X-modality), an open-source RL training engine that addresses these challenges through three co-designed architectural layers. First, an \emph{omni-native architecture} builds multimodal support into the full stack -- from data preprocessing and modality-aware parallelism to inference generation -- rather than retrofitting it onto a text-centric pipeline. Second, each RL role runs as an independent, fault-isolated service that can be scaled, recovered, and upgraded without global coordination. Third, service-level decoupling enables asynchronous training via the TransferQueue data bus, where a single staleness parameter smoothly interpolates among on-policy, near-on-policy, and fully asynchronous execution. Relax achieves a 1.20$\times$ end-to-end speedup over veRL on Qwen3-4B on-policy training. Its fully async mode delivers a 1.76$\times$ speedup over colocate on Qwen3-4B and a 2.00$\times$ speedup on Qwen3-Omni-30B, while all modes converge to the same reward level. Relax supports R3 (Rollout Routing Replay)~\cite{ma2025r3} for MoE models with only 1.9\% overhead, compared to 32\% degradation in veRL under the same configuration. It further demonstrates stable omni-modal RL convergence on Qwen3-Omni across image, text, and audio, sustaining over 2{,}000 steps on video without degradation. Relax is available at https://github.com/rednote-ai/Relax.
翻译:强化学习(RL)后训练已被证明能有效解锁大语言模型的推理、自我反思和工具使用能力。随着模型扩展到全模态输入和智能体多轮工作流程,RL训练系统面临三个相互关联的挑战:异构数据流、大规模运行鲁棒性、以及数据陈旧性与吞吐量之间的权衡。本文提出\textbf{Relax}(面向智能体多模态的强化学习引擎),这是一个开源RL训练引擎,通过三个协同设计的架构层应对上述挑战。首先,\emph{全模态原生架构}将多模态支持嵌入完整栈——从数据预处理、模态感知并行到推理生成——而非在文本中心化流水线上进行改造。其次,每个RL角色作为独立、故障隔离的服务运行,可在无需全局协调的情况下进行扩展、恢复和升级。最后,服务级解耦通过TransferQueue数据总线实现异步训练,其中单一陈旧性参数可平滑插值在线策略、近在线策略与完全异步执行等多种模式。Relax在Qwen3-4B在线策略训练中相比veRL实现1.20倍端到端加速;其完全异步模式在Qwen3-4B上比同地部署加速1.76倍,在Qwen3-Omni-30B上加速2.00倍,且所有模式收敛至相同奖励水平。Relax支持针对MoE模型的R3(Rollout Routing Replay)\cite{ma2025r3},仅产生1.9%开销,而veRL在相同配置下性能下降32%。此外,Relax在Qwen3-Omni上展示了稳定的全模态RL收敛性(涵盖图像、文本和音频),并在视频任务上可持续运行超过2000步而无退化。项目地址:https://github.com/rednote-ai/Relax。