Generating Teammates for Training Robust Ad Hoc Teamwork Agents via Best-Response Diversity

Ad hoc teamwork (AHT) is the challenge of designing a robust learner agent that effectively collaborates with unknown teammates without prior coordination mechanisms. Early approaches address the AHT challenge by training the learner with a diverse set of handcrafted teammate policies, usually designed based on an expert's domain knowledge about the policies the learner may encounter. However, implementing teammate policies for training based on domain knowledge is not always feasible. In such cases, recent approaches attempted to improve the robustness of the learner by training it with teammate policies generated by optimising information-theoretic diversity metrics. The problem with optimising existing information-theoretic diversity metrics for teammate policy generation is the emergence of superficially different teammates. When used for AHT training, superficially different teammate behaviours may not improve a learner's robustness during collaboration with unknown teammates. In this paper, we present an automated teammate policy generation method optimising the Best-Response Diversity (BRDiv) metric, which measures diversity based on the compatibility of teammate policies in terms of returns. We evaluate our approach in environments with multiple valid coordination strategies, comparing against methods optimising information-theoretic diversity metrics and an ablation not optimising any diversity metric. Our experiments indicate that optimising BRDiv yields a diverse set of training teammate policies that improve the learner's performance relative to previous teammate generation approaches when collaborating with near-optimal previously unseen teammate policies.

翻译：临时团队协作（AHT）是指设计一个鲁棒学习型智能体的挑战，该智能体能在没有预先协调机制的情况下与未知队友有效协作。早期方法通过使用一组多样化的手工设计队友策略（通常基于专家对智能体可能遇到的策略的领域知识）来训练学习型智能体，从而应对这一挑战。然而，基于领域知识实现训练用的队友策略并不总是可行。在这种情况下，近期方法尝试通过使用优化信息论多样性度量生成的队友策略来训练学习型智能体，以提升其鲁棒性。利用现有信息论多样性度量生成队友策略的问题在于会产生表面差异化的队友。当用于AHT训练时，表面差异化的队友行为可能无法提升学习型智能体与未知队友协作时的鲁棒性。本文提出了一种自动化队友策略生成方法，该方法优化了最优响应多样性（BRDiv）度量——一种基于队友策略在回报方面的兼容性来衡量多样性的指标。我们在具有多种有效协调策略的环境中评估了该方法，并将其与优化信息论多样性度量的方法及不优化任何多样性度量的消融方法进行了比较。实验表明，在与近乎最优且未见过的队友策略协作时，相较于先前的队友生成方法，优化BRDiv能产生一组多样化的训练队友策略，从而提升学习型智能体的性能。