The rise of multi-agent systems, especially the success of multi-agent reinforcement learning (MARL), is reshaping our future across diverse domains like autonomous vehicle networks. However, MARL still faces significant challenges, particularly in achieving zero-shot scalability, which allows trained MARL models to be directly applied to unseen tasks with varying numbers of agents. In addition, real-world multi-agent systems usually contain agents with different functions and strategies, while the existing scalable MARL methods only have limited heterogeneity. To address this, we propose a novel MARL framework named Scalable and Heterogeneous Proximal Policy Optimization (SHPPO), integrating heterogeneity into parameter-shared PPO-based MARL networks. we first leverage a latent network to adaptively learn strategy patterns for each agent. Second, we introduce a heterogeneous layer for decision-making, whose parameters are specifically generated by the learned latent variables. Our approach is scalable as all the parameters are shared except for the heterogeneous layer, and gains both inter-individual and temporal heterogeneity at the same time. We implement our approach based on the state-of-the-art backbone PPO-based algorithm as SHPPO, while our approach is agnostic to the backbone and can be seamlessly plugged into any parameter-shared MARL method. SHPPO exhibits superior performance over the baselines such as MAPPO and HAPPO in classic MARL environments like Starcraft Multi-Agent Challenge (SMAC) and Google Research Football (GRF), showcasing enhanced zero-shot scalability and offering insights into the learned latent representation's impact on team performance by visualization.
翻译:多智能体系统的兴起,尤其是多智能体强化学习(MARL)的成功,正在重塑自动驾驶网络等多个领域的未来。然而,MARL仍面临重大挑战,特别是在实现零样本可扩展性方面,这要求训练后的MARL模型能够直接应用于智能体数量不同的未知任务。此外,现实世界的多智能体系统通常包含具有不同功能和策略的智能体,而现有可扩展MARL方法仅具有有限的异质性。为解决这一问题,我们提出了一种名为可扩展异质近端策略优化(SHPPO)的新型MARL框架,将异质性融入基于参数共享的PPO型MARL网络中。首先,我们利用潜在网络自适应地学习每个智能体的策略模式;其次,引入用于决策的异质层,其参数由学习到的潜在变量专门生成。由于除异质层外所有参数均共享,我们的方法具有可扩展性,并同时获得个体间异质性和时间异质性。我们基于当前先进的PPO型算法作为SHPPO骨干实现该方法,而我们的方法对骨干网络无关,可无缝嵌入任何参数共享型MARL方法。在经典MARL环境(如星际争霸多智能体挑战(SMAC)和谷歌研究足球(GRF))中,SHPPO相比MAPPO和HAPPO等基线方法展现出更优性能,增强了零样本可扩展性,并通过可视化揭示了学习到的潜在表示对团队表现的影响。