We introduce hybrid execution in multi-agent reinforcement learning (MARL), a new paradigm in which agents aim to successfully complete cooperative tasks with arbitrary communication levels at execution time by taking advantage of information-sharing among the agents. Under hybrid execution, the communication level can range from a setting in which no communication is allowed between agents (fully decentralized), to a setting featuring full communication (fully centralized), but the agents do not know beforehand which communication level they will encounter at execution time. To formalize our setting, we define a new class of multi-agent partially observable Markov decision processes (POMDPs) that we name hybrid-POMDPs, which explicitly model a communication process between the agents. We contribute MARO, an approach that makes use of an auto-regressive predictive model, trained in a centralized manner, to estimate missing agents' observations at execution time. We evaluate MARO on standard scenarios and extensions of previous benchmarks tailored to emphasize the negative impact of partial observability in MARL. Experimental results show that our method consistently outperforms relevant baselines, allowing agents to act with faulty communication while successfully exploiting shared information.
翻译:我们在多智能体强化学习(MARL)中引入混合执行这一新范式,智能体在此范式下通过利用智能体间的信息共享,在运行时以任意通信级别成功完成协作任务。在混合执行下,通信级别范围涵盖从智能体间不允许通信(完全分散)到具备完全通信(完全集中)的设置,但智能体事先不知道运行时将面对的通信级别。为形式化该设置,我们定义了一类新的多智能体部分可观测马尔可夫决策过程(POMDP),将其命名为混合POMDP,该过程显式建模了智能体间的通信过程。我们提出MARO方法,该方法采用以集中方式训练的自回归预测模型,在运行时估计缺失的智能体观测。我们在标准场景及先前基准的扩展版本上评估MARO,这些扩展特别针对突出部分可观测性在MARL中的负面影响而设计。实验结果表明,我们的方法持续优于相关基线,使智能体能够在通信故障时仍成功利用共享信息执行任务。