Peer incentivization (PI) is a popular multi-agent reinforcement learning approach where all agents can reward or penalize each other to achieve cooperation in social dilemmas. Despite their potential for scalable cooperation, current PI methods heavily depend on fixed incentive values that need to be appropriately chosen with respect to the environmental rewards and thus are highly sensitive to their changes. Therefore, they fail to maintain cooperation under changing rewards in the environment, e.g., caused by modified specifications, varying supply and demand, or sensory flaws - even when the conditions for mutual cooperation remain the same. In this paper, we propose Dynamic Reward Incentives for Variable Exchange (DRIVE), an adaptive PI approach to cooperation in social dilemmas with changing rewards. DRIVE agents reciprocally exchange reward differences to incentivize mutual cooperation in a completely decentralized way. We show how DRIVE achieves mutual cooperation in the general Prisoner's Dilemma and empirically evaluate DRIVE in more complex sequential social dilemmas with changing rewards, demonstrating its ability to achieve and maintain cooperation, in contrast to current state-of-the-art PI methods.
翻译:同伴激励(PI)是一种流行的多智能体强化学习方法,其中所有智能体可以通过奖励或惩罚彼此来实现社会困境中的合作。尽管具有可扩展合作的潜力,但当前的PI方法严重依赖于固定的激励值,这些值需要根据环境奖励进行适当选择,因此对环境奖励的变化高度敏感。因此,当环境中的奖励发生变化时(例如由规范修改、供需变化或感知缺陷引起),即使相互合作的条件保持不变,这些方法也无法维持合作。本文提出了一种用于变化奖励下社会困境合作的自适应PI方法——动态奖励激励可变交换(DRIVE)。DRIVE智能体通过相互交换奖励差异,以完全去中心化的方式激励相互合作。我们展示了DRIVE如何在一般囚徒困境中实现相互合作,并在具有变化奖励的更复杂序列社会困境中对DRIVE进行了实证评估,证明了其实现和维持合作的能力,这与当前最先进的PI方法形成鲜明对比。