Learning anticipation is a reasoning paradigm in multi-agent reinforcement learning, where agents, during learning, consider the anticipated learning of other agents. There has been substantial research into the role of learning anticipation in improving cooperation among self-interested agents in general-sum games. Two primary examples are Learning with Opponent-Learning Awareness (LOLA), which anticipates and shapes the opponent's learning process to ensure cooperation among self-interested agents in various games such as iterated prisoner's dilemma, and Look-Ahead (LA), which uses learning anticipation to guarantee convergence in games with cyclic behaviors. So far, the effectiveness of applying learning anticipation to fully-cooperative games has not been explored. In this study, we aim to research the influence of learning anticipation on coordination among common-interested agents. We first illustrate that both LOLA and LA, when applied to fully-cooperative games, degrade coordination among agents, causing worst-case outcomes. Subsequently, to overcome this miscoordination behavior, we propose Hierarchical Learning Anticipation (HLA), where agents anticipate the learning of other agents in a hierarchical fashion. Specifically, HLA assigns agents to several hierarchy levels to properly regulate their reasonings. Our theoretical and empirical findings confirm that HLA can significantly improve coordination among common-interested agents in fully-cooperative normal-form games. With HLA, to the best of our knowledge, we are the first to unlock the benefits of learning anticipation for fully-cooperative games.
翻译:学习预判是多智能体强化学习中的一种推理范式,智能体在学习过程中会考虑其他智能体的学习预期。在一般和博弈中,已有大量研究探讨学习预判对提升自利智能体间合作的作用。两个典型例子是:对手学习意识学习(LOLA),通过预判并塑造对手的学习过程来确保自利智能体在重复囚徒困境等多种博弈中的合作;以及前瞻学习(LA),利用学习预判保证在存在循环行为的博弈中收敛。目前,学习预判在完全合作博弈中的有效性尚未得到探索。本研究旨在探究学习预判对共同利益智能体间协调的影响。我们首先证明,将LOLA和LA应用于完全合作博弈时,会降低智能体间的协调性,导致最坏情况结果。随后,为解决这种失调行为,我们提出层次化学习预判(HLA),使智能体以层次化方式预判其他智能体的学习。具体而言,HLA将智能体分配至多个层级以合理调节其推理过程。理论与实证结果证实,HLA能够显著提升完全合作规范式博弈中共同利益智能体间的协调性。据我们所知,这是首次通过HLA解锁学习预判对完全合作博弈的益处。