Cooperative multi-agent reinforcement learning (MARL) is typically framed as a decentralised partially observable Markov decision process (Dec-POMDP), a setting whose hardness stems from two key challenges: partial observability and decentralised coordination. Genuinely solving such tasks requires Dec-POMDP reasoning, where agents use history to infer hidden states and coordinate based on local information. Yet it remains unclear whether popular benchmarks actually demand this reasoning or permit success via simpler strategies. We introduce a diagnostic suite combining statistically grounded performance comparisons and information-theoretic probes to audit the behavioural complexity of baseline policies (IPPO and MAPPO) across 37 scenarios spanning MPE, SMAX, Overcooked, Hanabi, and MaBrax. Our diagnostics reveal that success on these benchmarks rarely requires genuine Dec-POMDP reasoning. Reactive policies match the performance of memory-based agents in over half the scenarios, and emergent coordination frequently relies on brittle, synchronous action coupling rather than robust temporal influence. These findings suggest that some widely used benchmarks may not adequately test core Dec-POMDP assumptions under current training paradigms, potentially leading to over-optimistic assessments of progress. We release our diagnostic tooling to support more rigorous environment design and evaluation in cooperative MARL.
翻译:合作多智能体强化学习通常被建模为去中心化部分可观测马尔可夫决策过程,该框架的复杂性源于两个核心挑战:部分可观测性与去中心化协调。真正解决此类任务需要Dec-POMDP推理能力,即智能体利用历史信息推断隐藏状态,并基于局部信息进行协调。然而,现有主流基准测试是否真正需要这种推理能力,还是允许通过简单策略获得成功,目前尚不明确。我们开发了一套诊断工具集,结合基于统计的性能比较与信息论探针,对IPPO和MAPPO基线策略在涵盖MPE、SMAX、Overcooked、Hanabi和MaBrax的37个场景中的行为复杂度进行审计。诊断结果表明:在这些基准测试中获得成功很少需要真正的Dec-POMDP推理能力。超过半数的场景中,反应式策略与基于记忆的智能体表现相当;涌现的协调行为往往依赖脆弱同步动作耦合,而非稳健的时序影响。这些发现表明,在当前训练范式下,部分广泛使用的基准测试可能未能充分检验Dec-POMDP的核心假设,可能导致对研究进展的过度乐观评估。我们公开诊断工具以支持合作多智能体强化学习中更严谨的环境设计与评估。