A central challenge in multi-agent reinforcement learning is enabling agents to adapt to previously unseen teammates in a zero-shot fashion. Prior work in zero-shot coordination often follows a two-stage process, first generating a diverse training pool of partner agents, and then training a best-response agent to collaborate effectively with the entire training pool. While many previous works have achieved strong performance by devising better ways to diversify the partner agent pool, there has been less emphasis on how to leverage this pool to build an adaptive agent. One limitation is that the best-response agent may converge to a static, generalist policy that performs reasonably well across diverse teammates, rather than learning a more adaptive, specialist policy that can better adapt to teammates and achieve higher synergy. To address this, we propose an adaptive ensemble agent that uses Theory-of-Mind-based best-response selection to first infer its teammate's intentions and then select the most suitable policy from a policy ensemble. We conduct experiments in the Overcooked environment to evaluate zero-shot coordination performance under both fully and partially observable settings. The empirical results demonstrate the superiority of our method over a single best-response baseline.
翻译:多智能体强化学习中的一个核心挑战是使智能体能够以零样本方式适应先前未见过的队友。零样本协作的先前工作通常遵循两阶段过程:首先生成多样化的合作伙伴智能体训练池,然后训练一个最佳响应智能体以与整个训练池有效协作。尽管许多先前研究通过设计更好的合作伙伴智能体池多样化方法取得了强劲性能,但如何利用该池构建自适应智能体的关注较少。一个局限性在于,最佳响应智能体可能收敛于一个静态的通用策略,该策略在不同队友间表现尚可,而非学习更具适应性、能更好适应队友并实现更高协同的专家策略。为解决此问题,我们提出一种自适应集成智能体,其使用基于心智理论的最佳响应选择机制:首先推断队友意图,然后从策略集成中选择最合适的策略。我们在Overcooked环境中进行实验,评估完全可观测与部分可观测设置下的零样本协作性能。实证结果表明,我们的方法优于单一最佳响应基线。