We introduce a reinforcement learning framework for economic design where the interaction between the environment designer and the participants is modeled as a Stackelberg game. In this game, the designer (leader) sets up the rules of the economic system, while the participants (followers) respond strategically. We integrate algorithms for determining followers' response strategies into the leader's learning environment, providing a formulation of the leader's learning problem as a POMDP that we call the Stackelberg POMDP. We prove that the optimal leader's strategy in the Stackelberg game is the optimal policy in our Stackelberg POMDP under a limited set of possible policies, establishing a connection between solving POMDPs and Stackelberg games. We solve our POMDP under a limited set of policy options via the centralized training with decentralized execution framework. For the specific case of followers that are modeled as no-regret learners, we solve an array of increasingly complex settings, including problems of indirect mechanism design where there is turn-taking and limited communication by agents. We demonstrate the effectiveness of our training framework through ablation studies. We also give convergence results for no-regret learners to a Bayesian version of a coarse-correlated equilibrium, extending known results to correlated types.
翻译:我们提出了一种用于经济设计的强化学习框架,其中环境设计者与参与者之间的互动被建模为一个斯塔克尔伯格博弈。在此博弈中,设计者(领导者)设定经济系统的规则,而参与者(跟随者)则做出策略性回应。我们将确定跟随者响应策略的算法整合到领导者的学习环境中,从而将领导者的学习问题表述为一个部分可观测马尔可夫决策过程,我们称之为斯塔克尔伯格部分可观测马尔可夫决策过程。我们证明,在有限的一组可能策略下,斯塔克尔伯格博弈中领导者的最优策略即为我们斯塔克尔伯格部分可观测马尔可夫决策过程中的最优策略,从而建立了求解部分可观测马尔可夫决策过程与斯塔克尔伯格博弈之间的联系。我们通过集中训练与分散执行的框架,在有限策略选项集合下求解了该部分可观测马尔可夫决策过程。针对跟随者被建模为无悔学习者的特定情况,我们求解了一系列日益复杂的设定,包括存在轮转和智能体间通信受限的间接机制设计问题。我们通过消融研究证明了所提训练框架的有效性。此外,我们给出了无悔学习者收敛到粗相关均衡贝叶斯版本的结果,将已知结论扩展到了相关类型的情形。