Social Optimum Equilibrium Selection for Distributed Multi-Agent Optimization

We study the open question of how players learn to play a social optimum pure-strategy Nash equilibrium (PSNE) through repeated interactions in general-sum coordination games. A social optimum of a game is the stable Pareto-optimal state that provides a maximum return in the sum of all players' payoffs (social welfare) and always exists. We consider finite repeated games where each player only has access to its own utility (or payoff) function but is able to exchange information with other players. We develop a novel regret matching (RM) based algorithm for computing an efficient PSNE solution that could approach a desired Pareto-optimal outcome yielding the highest social welfare among all the attainable equilibria in the long run. Our proposed learning procedure follows the regret minimization framework but extends it in three major ways: (1) agents use global, instead of local, utility for calculating regrets, (2) each agent maintains a small and diminishing exploration probability in order to explore various PSNEs, and (3) agents stay with the actions that achieve the best global utility thus far, regardless of regrets. We prove that these three extensions enable the algorithm to select the stable social optimum equilibrium instead of converging to an arbitrary or cyclic equilibrium as in the conventional RM approach. We demonstrate the effectiveness of our approach through a set of applications in multi-agent distributed control, including a large-scale resource allocation game and a hard combinatorial task assignment problem for which no efficient (polynomial) solution exists.

翻译：我们研究了在一般和协调博弈中，智能体如何通过重复互动学会选择社会最优纯策略纳什均衡（PSNE）这一开放性问题。博弈的社会最优状态是稳定的帕累托最优状态，能够最大化所有智能体收益总和（社会福利），且始终存在。我们考虑有限重复博弈，其中每个智能体仅能访问其自身效用（或收益）函数，但能与其他智能体交换信息。我们提出了一种基于遗憾匹配（RM）的新型算法，用于计算高效的PSNE解，该解能在长期内趋近于期望的帕累托最优结果，从而在所有可达均衡中获得最高的社会福利。我们提出的学习过程遵循遗憾最小化框架，但在三个主要方面进行了扩展：（1）智能体使用全局效用而非局部效用计算遗憾；（2）每个智能体保持一个较小且逐渐减小的探索概率，以便探索不同的PSNE；（3）智能体坚持使用迄今实现最佳全局效用的动作，而忽略遗憾值。我们证明，这三个扩展使得算法能够选择稳定的社会最优均衡，而非像传统RM方法那样收敛到任意均衡或循环均衡。我们通过多智能体分布式控制中的一系列应用验证了该方法的有效性，包括一个大规模资源分配游戏和一个不存在高效（多项式）解的困难组合任务分配问题。