A natural solution concept for many multiagent settings is the Stackelberg equilibrium, under which a ``leader'' agent selects a strategy that maximizes its own payoff assuming the ``follower'' chooses their best response to this strategy. Recent work has presented asymmetric learning updates that can be shown to converge to the \textit{differential} Stackelberg equilibria of two-player differentiable games. These updates are ``coupled'' in the sense that the leader requires some information about the follower's payoff function. Such coupled learning rules cannot be applied to \textit{ad hoc} interactive learning settings, and can be computationally impractical even in centralized training settings where the follower's payoffs are known. In this work, we present an ``uncoupled'' learning process under which each player's learning update only depends on their observations of the other's behavior. We prove that this process converges to a local Stackelberg equilibrium under similar conditions as previous coupled methods. We conclude with a discussion of the potential applications of our approach to human--AI cooperation and multi-agent reinforcement learning.
翻译:许多多智能体场景中的自然解概念是斯塔克尔伯格均衡,在该均衡下,“领导者”智能体选择一种策略以最大化自身收益,同时假设“跟随者”对该策略做出最佳反应。近期研究提出了一种非对称学习更新方法,可证明其收敛于双人可微博弈中的微分斯塔克尔伯格均衡。这些更新是“耦合”的,因为领导者需要获取跟随者收益函数的某些信息。此类耦合学习规则无法应用于“即席”交互学习场景,即使在集中训练场景中已知跟随者收益的情况下,也可能因计算代价过高而难以实现。本研究提出了一种“解耦”学习过程,其中每个智能体的学习更新仅依赖其对另一方行为的观测。我们证明,在类似于先前耦合方法的条件下,该过程能够收敛到局部斯塔克尔伯格均衡。最后,我们讨论了该方法在人机协作及多智能体强化学习中的潜在应用。