The emergence of multi-agent reinforcement learning (MARL) is significantly transforming various fields like autonomous vehicle networks. However, real-world multi-agent systems typically contain multiple roles, and the scale of these systems dynamically fluctuates. Consequently, in order to achieve zero-shot scalable collaboration, it is essential that strategies for different roles can be updated flexibly according to the scales, which is still a challenge for current MARL frameworks. To address this, we propose a novel MARL framework named Scalable and Heterogeneous Proximal Policy Optimization (SHPPO), integrating heterogeneity into parameter-shared PPO-based MARL networks. We first leverage a latent network to learn strategy patterns for each agent adaptively. Second, we introduce a heterogeneous layer to be inserted into decision-making networks, whose parameters are specifically generated by the learned latent variables. Our approach is scalable as all the parameters are shared except for the heterogeneous layer, and gains both inter-individual and temporal heterogeneity, allowing SHPPO to adapt effectively to varying scales. SHPPO exhibits superior performance in classic MARL environments like Starcraft Multi-Agent Challenge (SMAC) and Google Research Football (GRF), showcasing enhanced zero-shot scalability, and offering insights into the learned latent variables' impact on team performance by visualization.
翻译:多智能体强化学习(MARL)的出现正在深刻改变自动驾驶网络等多个领域。然而,现实世界的多智能体系统通常包含多种角色,且系统规模会动态波动。因此,为实现零样本可扩展协作,不同角色的策略必须能够根据规模灵活调整,这对现有MARL框架仍具挑战性。为此,我们提出名为可扩展异构近端策略优化(SHPPO)的新型MARL框架,将异构性融入基于参数共享PPO的MARL网络。我们首先利用潜在网络自适应学习各智能体的策略模式;其次,在决策网络中插入异构层,其参数由学习得到的潜变量专门生成。该方法具有可扩展性——除异构层外所有参数均共享,同时获得个体间与时序的异构性,使SHPPO能有效适应不同规模。SHPPO在星际争霸多智能体挑战(SMAC)和谷歌研究足球(GRF)等经典MARL环境中表现优异,展现出更强的零样本可扩展性,并通过可视化揭示了所学潜变量对团队绩效的影响机制。