Real-world multi-agent systems may require ad hoc teaming, where an agent must coordinate with other previously unseen teammates to solve a task in a zero-shot manner. Prior work often either selects a pretrained policy based on an inferred model of the new teammates or pretrains a single policy that is robust to potential teammates. Instead, we propose to leverage all pretrained policies in a zero-shot transfer setting. We formalize this problem as an ad hoc multi-agent Markov decision process and present a solution that uses two key ideas, generalized policy improvement and difference rewards, for efficient and effective knowledge transfer between different teams. We empirically demonstrate that our algorithm, Generalized Policy improvement for Ad hoc Teaming (GPAT), successfully enables zero-shot transfer to new teams in three simulated environments: cooperative foraging, predator-prey, and Overcooked. We also demonstrate our algorithm in a real-world multi-robot setting.
翻译:现实世界中的多智能体系统常需临时组队,即智能体必须与先前未接触过的队友以零样本方式协调完成任务。现有研究通常基于对新队友的推断模型选择预训练策略,或预训练单一策略以应对潜在队友。与此不同,我们提出在零样本迁移场景中利用所有预训练策略。我们将该问题形式化为临时多智能体马尔可夫决策过程,并提出一种解决方案,该方案运用广义策略改进与差分奖励两个核心思想,实现不同团队间高效的知识迁移。我们通过实验证明,所提出的临时组队广义策略改进算法(GPAT)在三个模拟环境(协作觅食、捕食者-猎物及《煮糊了》游戏)中成功实现了向新团队的零样本迁移,并在真实多机器人场景中验证了算法的有效性。