Existing communication methods for multi-agent reinforcement learning (MARL) in cooperative multi-robot problems are almost exclusively task-specific, training new communication strategies for each unique task. We address this inefficiency by introducing a communication strategy applicable to any task within a given environment. We pre-train the communication strategy without task-specific reward guidance in a self-supervised manner using a set autoencoder. Our objective is to learn a fixed-size latent Markov state from a variable number of agent observations. Under mild assumptions, we prove that policies using our latent representations are guaranteed to converge, and upper bound the value error introduced by our Markov state approximation. Our method enables seamless adaptation to novel tasks without fine-tuning the communication strategy, gracefully supports scaling to more agents than present during training, and detects out-of-distribution events in an environment. Empirical results on diverse MARL scenarios validate the effectiveness of our approach, surpassing task-specific communication strategies in unseen tasks. Our implementation of this work is available at https://github.com/proroklab/task-agnostic-comms.
翻译:现有面向多智能体强化学习在多机器人协作问题中的通信方法几乎均为任务特定型,需为每个独立任务训练新的通信策略。为克服这一低效性,我们提出一种适用于给定环境中任意任务的通信策略。该策略通过集合自编码器以自监督方式预训练,无需依赖任务特定的奖励引导。我们的目标是基于可变数量的智能体观测学习固定维度的隐式马尔可夫状态。在温和假设下,我们证明采用该隐式表达的策略可保证收敛,并对马尔可夫状态近似引入的价值误差给出上界。所提方法无需微调通信策略即可无缝适应新任务,优雅支持训练时未见的多智能体数量扩展,同时能检测环境中的分布外事件。在多样化多智能体强化学习场景中的实证结果验证了方法的有效性,在未见任务中超越了任务特定的通信策略。本工作实现代码见 https://github.com/proroklab/task-agnostic-comms。