We introduce Unsupervised Partner Design (UPD), a population-free multi-agent reinforcement learning method for robust ad-hoc teamwork. UPD generates training partners on-the-fly and selects them adaptively based on a learnability criterion, removing the need for pre-trained partner populations or manual parameter tuning. We show that this simple mechanism enables effective partner diversity and can be extended to joint partner-environment selection when a procedural level generator is available. Across Level-Based Foraging, Overcooked-AI, and the Overcooked Generalisation Challenge, UPD consistently achieves strong performance compared to both population-based and population-free baselines. In a human-AI user study, agents trained with UPD achieve higher returns and are rated as more adaptive, more human-like, and less frustrating than all evaluated baseline methods.
翻译:我们提出无监督搭档设计(Unsupervised Partner Design, UPD),一种无需群体预设的多智能体强化学习方法,用于鲁棒的临时团队协作。UPD实时生成训练搭档,并基于可学习性准则自适应选择搭档,无需预训练的搭档群体或手动参数调整。我们证明,这一简单机制能够有效实现搭档多样性,并在存在程序化关卡生成器时,可扩展至联合搭档-环境选择。在基于层级的觅食(Level-Based Foraging)、Overcooked-AI及Overcooked通用挑战(Overcooked Generalisation Challenge)任务中,相较于基于群体和无群体预设的基线方法,UPD始终取得卓越性能。在人机交互用户研究中,经UPD训练的智能体获得更高回报,并在适应性、拟人化程度及减少挫败感方面均优于所有评估的基线方法。