Self-interested individuals often fail to cooperate, posing a fundamental challenge for multi-agent learning. How can we achieve cooperation among self-interested, independent learning agents? Promising recent work has shown that in certain tasks cooperation can be established between learning-aware agents who model the learning dynamics of each other. Here, we present the first unbiased, higher-derivative-free policy gradient algorithm for learning-aware reinforcement learning, which takes into account that other agents are themselves learning through trial and error based on multiple noisy trials. We then leverage efficient sequence models to condition behavior on long observation histories that contain traces of the learning dynamics of other agents. Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including a challenging environment where temporally-extended action coordination is required. Finally, we derive from the iterated prisoner's dilemma a novel explanation for how and when cooperation arises among self-interested learning-aware agents.
翻译:自利个体往往难以达成协作,这构成了多智能体学习的基础性挑战。我们应如何促使自利且独立学习的智能体实现合作?近期研究表明,在某些任务中,具有学习感知能力的智能体——即能够相互建模学习动态的智能体——可以建立合作关系。本文首次提出了一种无偏、无需高阶导数的策略梯度算法,专门用于学习感知的强化学习。该算法充分考虑了其他智能体基于多次噪声试验通过试错进行学习的过程。我们进一步利用高效的序列模型,使智能体行为能够以包含其他智能体学习动态轨迹的长观测历史为条件。使用本算法训练长上下文策略,可在标准社会困境场景(包括需要时序扩展动作协调的挑战性环境)中产生协作行为并获得高回报。最后,我们从迭代囚徒困境中推导出关于自利型学习感知智能体如何及何时产生合作行为的新理论解释。