In multi-agent reinforcement learning (MARL), self-interested agents attempt to establish equilibrium and achieve coordination depending on game structure. However, existing MARL approaches are mostly bound by the simultaneous actions of all agents in the Markov game (MG) framework, and few works consider the formation of equilibrium strategies via asynchronous action coordination. In view of the advantages of Stackelberg equilibrium (SE) over Nash equilibrium, we construct a spatio-temporal sequential decision-making structure derived from the MG and propose an N-level policy model based on a conditional hypernetwork shared by all agents. This approach allows for asymmetric training with symmetric execution, with each agent responding optimally conditioned on the decisions made by superior agents. Agents can learn heterogeneous SE policies while still maintaining parameter sharing, which leads to reduced cost for learning and storage and enhanced scalability as the number of agents increases. Experiments demonstrate that our method effectively converges to the SE policies in repeated matrix game scenarios, and performs admirably in immensely complex settings including cooperative tasks and mixed tasks.
翻译:在多智能体强化学习(MARL)中,自利型智能体试图根据博弈结构建立均衡并实现协调。然而,现有MARL方法大多受限于马尔可夫博弈(MG)框架中所有智能体的同时行动,且鲜有研究考虑通过异步行动协调形成均衡策略。鉴于斯塔克尔伯格均衡(SE)相对于纳什均衡的优势,我们构建了一种源于MG的时空序列决策结构,并提出了一种基于所有智能体共享的条件超网络的N级策略模型。该方法允许非对称训练与对称执行,每个智能体根据上级智能体的决策选择最优响应。智能体可在保持参数共享的同时学习异质性SE策略,从而降低学习与存储成本,并随智能体数量增加提升可扩展性。实验表明,我们的方法在重复矩阵博弈场景中有效收敛至SE策略,并在包含协作任务与混合任务的极复杂环境中表现出色。