Cooperative multi-agent reinforcement learning (MARL) is typically formalised as a Decentralised Partially Observable Markov Decision Process (Dec-POMDP), where agents must reason about the environment and other agents' behaviour. In practice, current model-free MARL algorithms use simple recurrent function approximators to address the challenge of reasoning about others using partial information. In this position paper, we argue that the empirical success of these methods is not due to effective Markov signal recovery, but rather to learning simple conventions that bypass environment observations and memory. Through a targeted case study, we show that co-adapting agents can learn brittle conventions, which then fail when partnered with non-adaptive agents. Crucially, the same models can learn grounded policies when the task design necessitates it, revealing that the issue is not a fundamental limitation of the learning models but a failure of the benchmark design. Our analysis also suggests that modern MARL environments may not adequately test the core assumptions of Dec-POMDPs. We therefore advocate for new cooperative environments built upon two core principles: (1) behaviours grounded in observations and (2) memory-based reasoning about other agents, ensuring success requires genuine skill rather than fragile, co-adapted agreements.
翻译:合作多智能体强化学习(MARL)通常被形式化为分散式部分可观测马尔可夫决策过程(Dec-POMDP),其中智能体必须对环境与其他智能体的行为进行推理。实践中,当前无模型MARL算法采用简单的循环函数逼近器来处理基于部分信息进行相互推理的挑战。在本立场论文中,我们认为这些方法的实证成功并非源于有效的马尔可夫信号恢复,而是通过学习能够绕过环境观测与记忆的简单约定。通过针对性案例研究,我们证明协同适应的智能体可能习得脆弱的约定,这些约定在与非适应性智能体协作时会失效。关键在于,当任务设计需要时,相同模型能够学习基于实际观测的策略,这表明问题并非学习模型的内在局限,而是基准设计的缺陷。我们的分析同时指出,现代MARL环境可能未能充分检验Dec-POMDP的核心假设。因此,我们主张基于两大核心原则构建新型合作环境:(1)行为需基于实际观测;(2)需通过记忆机制对其他智能体进行推理,确保成功依赖于真正的技能而非脆弱的协同适应约定。