Multi-Agent Reinforcement Learning (MARL) has achieved significant success in large-scale AI systems and big-data applications such as smart grids, surveillance, etc. Existing advancements in MARL algorithms focus on improving the rewards obtained by introducing various mechanisms for inter-agent cooperation. However, these optimizations are usually compute- and memory-intensive, thus leading to suboptimal speed performance in end-to-end training time. In this work, we analyze the speed performance (i.e., latency-bounded throughput) as the key metric in MARL implementations. Specifically, we first introduce a taxonomy of MARL algorithms from an acceleration perspective categorized by (1) training scheme and (2) communication method. Using our taxonomy, we identify three state-of-the-art MARL algorithms - Multi-Agent Deep Deterministic Policy Gradient (MADDPG), Target-oriented Multi-agent Communication and Cooperation (ToM2C), and Networked Multi-Agent RL (NeurComm) - as target benchmark algorithms, and provide a systematic analysis of their performance bottlenecks on a homogeneous multi-core CPU platform. We justify the need for MARL latency-bounded throughput to be a key performance metric in future literature while also addressing opportunities for parallelization and acceleration.
翻译:多智能体强化学习(MARL)在大规模人工智能系统和智能电网、监控等大数据应用中取得了显著成功。现有MARL算法的进展主要集中于通过引入多种智能体间协作机制来提升奖励获取能力。然而,这些优化通常计算密集且内存开销大,导致端到端训练过程中速度性能欠佳。本研究从速度性能(即延迟约束吞吐量)角度出发,将之作为MARL实现中的关键指标进行分析。具体而言,我们首先从加速视角建立MARL算法的分类体系,依据(1)训练方案和(2)通信方式进行归类。基于所提分类法,我们选取当前三种主流MARL算法——多智能体深度确定性策略梯度(MADDPG)、面向目标的多智能体通信与协作(ToM2C)以及网络化多智能体强化学习(NeurComm)——作为基准测试算法,并在同构多核CPU平台上系统分析其性能瓶颈。我们不仅论证了延迟约束吞吐量应作为未来文献中MARL性能核心指标的必要性,同时指出了并行化与加速技术的研究机遇。