Communication is crucial for solving cooperative Multi-Agent Reinforcement Learning tasks in Partially-Observable Markov Decision Processes. Existing works often rely on black-box methods to encode local information/features into messages shared with other agents. However, such black-box approaches are unable to provide any quantitative guarantees on the expected return and often lead to the generation of continuous messages with high communication overhead and poor interpretability. In this paper, we establish an upper bound on the return gap between an ideal policy with full observability and an optimal partially-observable policy with discrete communication. This result enables us to recast multi-agent communication into a novel online clustering problem over the local observations at each agent, with messages as cluster labels and the upper bound on the return gap as clustering loss. By minimizing the upper bound, we propose a surprisingly simple design of message generation functions in multi-agent communication and integrate it with reinforcement learning using a Regularized Information Maximization loss function. Evaluations show that the proposed discrete communication significantly outperforms state-of-the-art multi-agent communication baselines and can achieve nearly-optimal returns with few-bit messages that are naturally interpretable.
翻译:通信对于解决部分可观测马尔可夫决策过程中的合作型多智能体强化学习任务至关重要。现有工作通常依赖黑箱方法将局部信息/特征编码为与其他智能体共享的消息。然而,这种黑箱方法无法对期望回报提供任何量化保证,且常导致生成具有高通信开销和差可解释性的连续消息。本文建立了在全可观测理想策略与带离散通信的最优部分可观测策略之间回报差距的上界。该结果使我们能够将多智能体通信重新表述为基于每个智能体局部观测的新型在线聚类问题,其中消息作为聚类标签,回报差距上界作为聚类损失。通过最小化该上界,我们提出了多智能体通信中消息生成函数的极简设计方案,并使用正则化信息最大化损失函数将其与强化学习相集成。评估表明,所提出的离散通信显著优于当前最先进的多智能体通信基线方法,且能够以天然可解释的少比特消息实现近乎最优的回报。