Multiagent reinforcement learning (MARL) has benefited significantly from population-based and game-theoretic training regimes. One approach, Policy-Space Response Oracles (PSRO), employs standard reinforcement learning to compute response policies via approximate best responses and combines them via meta-strategy selection. We augment PSRO by adding a novel search procedure with generative sampling of world states, and introduce two new meta-strategy solvers based on the Nash bargaining solution. We evaluate PSRO's ability to compute approximate Nash equilibrium, and its performance in two negotiation games: Colored Trails, and Deal or No Deal. We conduct behavioral studies where human participants negotiate with our agents ($N = 346$). We find that search with generative modeling finds stronger policies during both training time and test time, enables online Bayesian co-player prediction, and can produce agents that achieve comparable social welfare negotiating with humans as humans trading among themselves.
翻译:多智能体强化学习(MARL)已显著受益于基于种群和博弈理论的训练范式。其中一种方法——策略空间响应预言机(PSRO)——利用标准强化学习通过近似最优响应计算响应策略,并通过元策略选择进行组合。我们通过引入一种基于世界状态生成采样的新型搜索过程来增强PSRO,并基于纳什议价解引入了两种新的元策略求解器。我们评估了PSRO计算近似纳什均衡的能力,及其在两个谈判游戏——彩色路径博弈与“成交或放弃”博弈中的表现。我们开展了行为学研究,让人类参与者与我们的智能体进行谈判(实验人数$N = 346$)。研究发现,基于生成模型的搜索在训练和测试阶段均能发现更强的策略,支持在线贝叶斯共谋者预测,并能生成在与人类谈判时达到与人类内部交易相当的社会福利水平的智能体。