Zero-shot coordination (ZSC) is a new cooperative multi-agent reinforcement learning (MARL) challenge that aims to train an ego agent to work with diverse, unseen partners during deployment. The significant difference between the deployment-time partners' distribution and the training partners' distribution determined by the training algorithm makes ZSC a unique out-of-distribution (OOD) generalization challenge. The potential distribution gap between evaluation and deployment-time partners leads to inadequate evaluation, which is exacerbated by the lack of appropriate evaluation metrics. In this paper, we present ZSC-Eval, the first evaluation toolkit and benchmark for ZSC algorithms. ZSC-Eval consists of: 1) Generation of evaluation partner candidates through behavior-preferring rewards to approximate deployment-time partners' distribution; 2) Selection of evaluation partners by Best-Response Diversity (BR-Div); 3) Measurement of generalization performance with various evaluation partners via the Best-Response Proximity (BR-Prox) metric. We use ZSC-Eval to benchmark ZSC algorithms in Overcooked and Google Research Football environments and get novel empirical findings. We also conduct a human experiment of current ZSC algorithms to verify the ZSC-Eval's consistency with human evaluation. ZSC-Eval is now available at https://github.com/sjtu-marl/ZSC-Eval.
翻译:零样本协作(ZSC)是一种新兴的合作式多智能体强化学习(MARL)挑战,其目标是训练一个自我智能体在部署时能够与多样化的、未见过的伙伴智能体协同工作。由于部署时伙伴智能体的分布与训练算法所确定的训练伙伴分布存在显著差异,ZSC 构成了一个独特的分布外(OOD)泛化挑战。评估环境与部署时伙伴智能体之间潜在的数据分布差异会导致评估不充分,而缺乏合适的评估指标则加剧了这一问题。本文提出了 ZSC-Eval,这是首个面向 ZSC 算法的评估工具包与基准。ZSC-Eval 包含三个核心组成部分:1)通过行为偏好奖励生成评估伙伴候选集,以逼近部署时伙伴智能体的分布;2)利用最佳响应多样性(BR-Div)筛选评估伙伴;3)采用最佳响应接近度(BR-Prox)指标,通过多样化的评估伙伴来度量泛化性能。我们在 Overcooked 和 Google Research Football 环境中使用 ZSC-Eval 对多种 ZSC 算法进行了基准测试,并获得了新颖的实验发现。我们还针对当前主流 ZSC 算法开展了人类实验,以验证 ZSC-Eval 与人类评估结果的一致性。ZSC-Eval 现已开源,访问地址为:https://github.com/sjtu-marl/ZSC-Eval。