Uncoupled Learning of Differential Stackelberg Equilibria with Commitments

In multi-agent problems requiring a high degree of cooperation, success often depends on the ability of the agents to adapt to each other's behavior. A natural solution concept in such settings is the Stackelberg equilibrium, in which the ``leader'' agent selects the strategy that maximizes its own payoff given that the ``follower'' agent will choose their best response to this strategy. Recent work has extended this solution concept to two-player differentiable games, such as those arising from multi-agent deep reinforcement learning, in the form of the \textit{differential} Stackelberg equilibrium. While this previous work has presented learning dynamics which converge to such equilibria, these dynamics are ``coupled'' in the sense that the learning updates for the leader's strategy require some information about the follower's payoff function. As such, these methods cannot be applied to truly decentralised multi-agent settings, particularly ad hoc cooperation, where each agent only has access to its own payoff function. In this work we present ``uncoupled'' learning dynamics based on zeroth-order gradient estimators, in which each agent's strategy update depends only on their observations of the other's behavior. We analyze the convergence of these dynamics in general-sum games, and prove that they converge to differential Stackelberg equilibria under the same conditions as previous coupled methods. Furthermore, we present an online mechanism by which symmetric learners can negotiate leader-follower roles. We conclude with a discussion of the implications of our work for multi-agent reinforcement learning and ad hoc collaboration more generally.

翻译：在需要高度合作的多智能体问题中，成功往往取决于智能体适应彼此行为的能力。斯塔克尔伯格均衡是此类场景中一种自然的解概念，其中“领导者”智能体选择能最大化自身收益的策略，前提是“跟随者”智能体会针对该策略选择其最优响应。近期研究将这一解概念扩展至双方可微博弈（例如多智能体深度强化学习中出现的博弈），形成了微分斯塔克尔伯格均衡。虽然已有研究提出了收敛至此类均衡的学习动态，但这些动态是“耦合”的，即领导者策略的学习更新需要部分关于跟随者收益函数的信息。因此，这些方法无法应用于真正的去中心化多智能体场景，特别是临时合作场景，其中每个智能体仅能访问自身的收益函数。本工作提出了基于零阶梯度估计器的“非耦合”学习动态，其中每个智能体的策略更新仅依赖于其对另一方行为的观测。我们在一般和博弈中分析了这些动态的收敛性，并证明在与先前耦合方法相同的条件下，它们能收敛至微分斯塔克尔伯格均衡。此外，我们提出了一种在线机制，使得对称学习者可以通过协商确定领导者-跟随者角色。最后，我们讨论了本研究对多智能体强化学习及更广泛的临时协作场景的启示。