Despite algorithm-level innovations for multi-agent reinforcement learning (MARL), the underlying networked infrastructure for large-scale MARL training remains underexplored. Existing training frameworks primarily optimize for single-agent scenarios and fail to address the unique system-level challenges of MARL, including rollout-training synchronization barriers, rollout load imbalance, and training resource underutilization. To bridge this gap, we propose FlexMARL, the first end-to-end training framework that holistically optimizes rollout, training, and their orchestration for large-scale LLM-based MARL. Specifically, FlexMARL introduces the joint orchestrator to manage data flow under the rollout-training disaggregated architecture. Building upon the experience store, a novel micro-batch driven asynchronous pipeline eliminates the synchronization barriers while providing strong consistency guarantees. Rollout engine adopts a parallel sampling scheme combined with hierarchical load balancing, which adapts to skewed inter/intra-agent request patterns. Training engine achieves on-demand hardware binding through agent-centric resource allocation. The training states of different agents are swapped via unified and location-agnostic communication. Empirical results on a large-scale production cluster demonstrate that FlexMARL achieves up to 7.3x speedup and improves hardware utilization by up to 5.6x compared to existing frameworks.
翻译:尽管多智能体强化学习(MARL)在算法层面不断创新,但其大规模训练所依赖的网络化基础设施仍未被充分探索。现有训练框架主要针对单智能体场景进行优化,未能解决MARL特有的系统级挑战,包括推演-训练同步屏障、推演负载不均衡以及训练资源利用率不足等问题。为弥补这一空白,我们提出了FlexMARL——首个针对大规模基于LLM的MARL进行端到端优化的训练框架,该框架从整体上协同优化了推演、训练及其编排过程。具体而言,FlexMARL在推演-训练解耦架构下引入联合编排器来管理数据流。基于经验存储库,一种新颖的微批次驱动异步流水线在消除同步屏障的同时提供了强一致性保证。推演引擎采用并行采样方案并结合分层负载均衡技术,能够自适应智能体间/智能体内倾斜的请求模式。训练引擎通过以智能体为中心的资源分配实现按需硬件绑定。不同智能体的训练状态通过统一且位置无关的通信机制进行交换。在大规模生产集群上的实验结果表明,与现有框架相比,FlexMARL实现了最高7.3倍的加速,并将硬件利用率提升了最高5.6倍。