In multi-agent reinforcement learning (MARL), self-interested agents attempt to establish equilibrium and achieve coordination depending on game structure. However, existing MARL approaches are mostly bound by the simultaneous actions of all agents in the Markov game (MG) framework, and few works consider the formation of equilibrium strategies via asynchronous action coordination. In view of the advantages of Stackelberg equilibrium (SE) over Nash equilibrium, we construct a spatio-temporal sequential decision-making structure derived from the MG and propose an N-level policy model based on a conditional hypernetwork shared by all agents. This approach allows for asymmetric training with symmetric execution, with each agent responding optimally conditioned on the decisions made by superior agents. Agents can learn heterogeneous SE policies while still maintaining parameter sharing, which leads to reduced cost for learning and storage and enhanced scalability as the number of agents increases. Experiments demonstrate that our method effectively converges to the SE policies in repeated matrix game scenarios, and performs admirably in immensely complex settings including cooperative tasks and mixed tasks.
翻译:在多智能体强化学习(MARL)中,自利智能体试图建立均衡并通过博弈结构实现协调。然而,现有MARL方法大多局限于马尔可夫博弈(MG)框架下所有智能体的同时行动,鲜有研究考虑通过异步行动协调形成均衡策略。鉴于斯塔克尔伯格均衡(SE)相对于纳什均衡的优势,我们基于MG构建了一种时空序贯决策结构,并提出了一种基于所有智能体共享条件超网络的N级策略模型。该方法支持非对称训练与对称执行,每个智能体可根据上层智能体的决策作出最优响应。智能体在保持参数共享的同时能够学习异构的SE策略,从而降低学习与存储成本,并随着智能体数量增加提升可扩展性。实验表明,我们的方法在重复矩阵博弈场景中能有效收敛至SE策略,并在包括协作任务与混合任务在内的极其复杂环境中表现出色。