Reinforcement learning (RL) has become a critical paradigm for LLM post-training, yet the rollout phase -- accounting for 50--80% of total step time -- is bottlenecked by skewed generation: long-tailed trajectories indispensable for model performance block the entire training pipeline. Asynchronous training offers a natural remedy by overlapping generation with training, but introduces a fundamental tension between efficiency and algorithmic correctness. We identify three constraints in asynchronous training to preserve convergence: intra-trajectory policy consistency, data integrity, and bounded staleness. Existing approaches fail to intrinsically address the long-tailed trajectory problem, which is further exacerbated by the imbalance characteristic of Mix-of-Experts models, or deviate from the standard RL training formulation, thereby hindering model convergence. Therefore, we propose DORA (Dynamic ORchestration for Asynchronous Rollout), which addresses this challenge through algorithm-system co-design. DORA introduces multi-version streaming rollout, a novel asynchronous paradigm that maintains multiple policy versions concurrently -- simultaneously achieving full bubble elimination without compromising algorithmic constraints. Experimental results demonstrate that our DORA system achieves substantial improvements in throughput -- up to 2--3 times higher than state-of-the-art systems on open-source benchmarks -- without compromising convergence. Furthermore, in large-scale industrial applications with tens of thousands of accelerators, DORA accelerates RL training by 2--4 times compared to synchronous training across various scenarios. The resultant open-source models, LongCat-Flash-Thinking, exhibit competitive performance on complex reasoning benchmarks, matching the capability of most advanced LLMs.
翻译:强化学习(RL)已成为大语言模型后训练的关键范式,然而其样本生成阶段——占用总训练步长的50%-80%——受制于偏斜的生成特征:对模型性能不可或缺的长尾轨迹阻塞了整个训练流程。异步训练通过重叠生成与训练过程提供了天然解决方案,但由此引发了效率与算法正确性之间的根本矛盾。我们识别出异步训练中需满足三项约束以保持收敛性:轨迹内策略一致性、数据完整性以及有界陈旧性。现有方法既无法从本质上解决长尾轨迹问题(该问题因混合专家模型的非平衡特性而进一步恶化),也可能偏离标准RL训练框架,从而阻碍模型收敛。为此,我们提出DORA(面向异步样本生成的动态编排机制),通过算法-系统协同设计解决上述挑战。DORA引入多版本流式样本生成这一新型异步范式,通过并行维护多个策略版本,在完全消除流水线气泡的同时不牺牲算法约束。实验结果表明,DORA系统在吞吐量方面实现显著提升——在开源基准测试中较现有最优系统提高2-3倍,且不牺牲收敛性。此外,在包含数万个加速器的大规模工业应用中,DORA可在多种场景下将RL训练加速至同步训练的2-4倍。基于该方法开源出的LongCat-Flash-Thinking模型,在复杂推理基准测试中展现出与最先进大语言模型相匹敌的竞争性能。