Achieving cooperation among self-interested agents remains a fundamental challenge in multi-agent reinforcement learning. Recent work showed that mutual cooperation can be induced between "learning-aware" agents that account for and shape the learning dynamics of their co-players. However, existing approaches typically rely on hardcoded, often inconsistent, assumptions about co-player learning rules or enforce a strict separation between "naive learners" updating on fast timescales and "meta-learners" observing these updates. Here, we demonstrate that the in-context learning capabilities of sequence models allow for co-player learning awareness without requiring hardcoded assumptions or explicit timescale separation. We show that training sequence model agents against a diverse distribution of co-players naturally induces in-context best-response strategies, effectively functioning as learning algorithms on the fast intra-episode timescale. We find that the cooperative mechanism identified in prior work-where vulnerability to extortion drives mutual shaping-emerges naturally in this setting: in-context adaptation renders agents vulnerable to extortion, and the resulting mutual pressure to shape the opponent's in-context learning dynamics resolves into the learning of cooperative behavior. Our results suggest that standard decentralized reinforcement learning on sequence models combined with co-player diversity provides a scalable path to learning cooperative behaviors.
翻译:在自利智能体之间实现合作仍然是多智能体强化学习中的一个基本挑战。近期研究表明,通过考虑并塑造协智能体的学习动态,"学习感知"智能体之间可以诱导出相互合作。然而,现有方法通常依赖于对协智能体学习规则的硬编码且往往不一致的假设,或是强制在快速时间尺度更新的"朴素学习者"与观察这些更新的"元学习者"之间建立严格分离。本文证明,序列模型的情境学习能力可以在无需硬编码假设或显式时间尺度分离的情况下实现协智能体学习感知。我们通过训练序列模型智能体对抗多样化的协智能体分布,自然诱导出情境化最优响应策略,这些策略在快速的情节内时间尺度上有效地发挥着学习算法的作用。我们发现先前工作中确定的合作机制——即易受勒索的脆弱性驱动相互塑造——在此设置中自然涌现:情境化适应使智能体易受勒索,而由此产生的相互塑造对手情境学习动态的压力最终演化为合作行为的学习。我们的研究结果表明,基于序列模型的标准去中心化强化学习结合协智能体多样性,为学习合作行为提供了一条可扩展的路径。