We introduce a reinforcement learning framework for economic design problems. We model the interaction between the designer of the economic environment and the participants as a Stackelberg game: the designer (leader) sets up the rules, and the participants (followers) respond strategically. We model the followers via no-regret dynamics, which converge to a Bayesian Coarse-Correlated Equilibrium (B-CCE) of the game induced by the leader. We embed the followers' no-regret dynamics in the leader's learning environment, which allows us to formulate our learning problem as a POMDP. We call this POMDP the Stackelberg POMDP. We prove that the optimal policy of the Stackelberg POMDP achieves the same utility as the optimal leader's strategy in our Stackelberg game. We solve the Stackelberg POMDP using an actor-critic method, where the critic can access the joint information of all agents. Finally, we show that we are able to learn optimal leader strategies in a variety of settings, including scenarios where the leader is participating in or designing normal-form games, as well as settings with incomplete information that capture common aspects of indirect mechanism design such as limited communication and turn-taking play by agents.
翻译:我们提出了一种面向经济设计问题的强化学习框架。我们将经济环境设计者与参与者之间的交互建模为Stackelberg博弈:设计者(领导者)制定规则,而参与者(追随者)以策略性方式做出响应。我们采用无悔动力学对追随者进行建模,该动力学收敛至领导者所诱导博弈的贝叶斯粗相关均衡(B-CCE)。我们将追随者的无悔动力学嵌入领导者的学习环境中,从而将学习问题形式化为POMDP,并称之为Stackelberg POMDP。我们证明Stackelberg POMDP的最优策略能够达到与Stackelberg博弈中领导者最优策略相同的效用。我们采用参与者-评价者方法求解Stackelberg POMDP,其中评价者可获取所有智能体的联合信息。最后,我们展示在多种场景下能够学习到领导者的最优策略,包括领导者参与或设计标准式博弈的情形,以及包含不完全信息的场景(这些场景刻画了间接机制设计中的常见要素,如有限通信和智能体轮次行动)。