Multi-agent Reinforcement Learning (MARL) is a powerful tool for training autonomous agents acting independently in a common environment. However, it can lead to sub-optimal behavior when individual incentives and group incentives diverge. Humans are remarkably capable at solving these social dilemmas. It is an open problem in MARL to replicate such cooperative behaviors in selfish agents. In this work, we draw upon the idea of formal contracting from economics to overcome diverging incentives between agents in MARL. We propose an augmentation to a Markov game where agents voluntarily agree to binding transfers of reward, under pre-specified conditions. Our contributions are theoretical and empirical. First, we show that this augmentation makes all subgame-perfect equilibria of all Fully Observable Markov Games exhibit socially optimal behavior, given a sufficiently rich space of contracts. Next, we show that for general contract spaces, and even under partial observability, richer contract spaces lead to higher welfare. Hence, contract space design solves an exploration-exploitation tradeoff, sidestepping incentive issues. We complement our theoretical analysis with experiments. Issues of exploration in the contracting augmentation are mitigated using a training methodology inspired by multi-objective reinforcement learning: Multi-Objective Contract Augmentation Learning (MOCA). We test our methodology in static, single-move games, as well as dynamic domains that simulate traffic, pollution management and common pool resource management.
翻译:多智能体强化学习(MARL)是训练自主智能体在共同环境中独立行动的强大工具。然而,当个体激励与群体激励存在分歧时,它可能导致次优行为。人类在解决这些社会困境方面表现出卓越的能力。如何在自私的智能体中复制此类合作行为,是MARL领域的一个开放性问题。本研究借鉴经济学中的形式化合约思想,以克服MARL中智能体间的激励分歧。我们提出一种马尔可夫博弈的增强机制:智能体在预设条件下自愿同意具有约束力的奖励转移。我们的贡献兼具理论与实证意义。首先,我们证明,在合约空间足够丰富的情况下,该增强机制能使所有完全可观测马尔可夫博弈的子博弈完美均衡均呈现社会最优行为。其次,研究表明,对于一般合约空间(即使在部分可观测条件下),更丰富的合约空间能带来更高社会福利。因此,合约空间设计在规避激励问题的同时,解决了探索-利用权衡问题。我们通过实验补充理论分析。针对合约增强中的探索问题,我们采用受多目标强化学习启发的训练方法——多目标合约增强学习(MOCA)予以缓解。我们在静态单步博弈以及模拟交通、污染管理和公共池塘资源管理的动态域中验证了该方法。