Cooperative multi-agent reinforcement learning (MARL) is typically framed as a decentralised partially observable Markov decision process (Dec-POMDP), a setting whose hardness stems from two key challenges: partial observability and decentralised coordination. Genuinely solving such tasks requires Dec-POMDP reasoning, where agents use history to infer hidden states and coordinate based on local information. Yet it remains unclear whether popular benchmarks actually demand this reasoning or permit success via simpler strategies. We introduce a diagnostic suite combining statistically grounded performance comparisons and information-theoretic probes to audit the behavioural complexity of baseline policies (IPPO and MAPPO) across 37 scenarios spanning MPE, SMAX, Overcooked, Hanabi, and MaBrax. Our diagnostics reveal that success on these benchmarks rarely requires genuine Dec-POMDP reasoning. Reactive policies match the performance of memory-based agents in over half the scenarios, and emergent coordination frequently relies on brittle, synchronous action coupling rather than robust temporal influence. These findings suggest that some widely used benchmarks may not adequately test core Dec-POMDP assumptions under current training paradigms, potentially leading to over-optimistic assessments of progress. We release our diagnostic tooling to support more rigorous environment design and evaluation in cooperative MARL.
翻译:合作式多智能体强化学习(MARL)通常被建模为分散部分可观测马尔可夫决策过程(Dec-POMDP),其难度源于两个关键挑战:部分可观测性与分散协调。真正解决此类任务需要Dec-POMDP推理,即智能体利用历史信息推断隐藏状态,并基于局部信息进行协调。然而,当前主流基准测试是否真正需要这种推理,抑或可通过更简单的策略成功应对,仍不明确。我们引入一套诊断工具,结合基于统计的性能比较与信息论探测方法,在涵盖MPE、SMAX、Overcooked、Hanabi和MaBrax的37个场景中,审计基线策略(IPPO和MAPPO)的行为复杂度。诊断结果表明,在这些基准测试中成功完成任务极少需要真正的Dec-POMDP推理:超过半数的场景中,反应式策略的性能即可匹敌基于记忆的智能体;而涌现的协调行为往往依赖脆弱的同步动作耦合,而非稳健的时间影响。这些发现表明,在当前训练范式下,部分广泛使用的基准测试可能未能充分检验Dec-POMDP的核心前提,从而可能导致对进展的过度乐观评估。我们公开了诊断工具,以支持合作式MARL中更严谨的环境设计与评估。