Progress in fields of machine learning and adversarial planning has benefited significantly from benchmark domains, from checkers and the classic UCI data sets to Go and Diplomacy. In sequential decision-making, agent evaluation has largely been restricted to few interactions against experts, with the aim to reach some desired level of performance (e.g. beating a human professional player). We propose a benchmark for multiagent learning based on repeated play of the simple game Rock, Paper, Scissors along with a population of forty-three tournament entries, some of which are intentionally sub-optimal. We describe metrics to measure the quality of agents based both on average returns and exploitability. We then show that several RL, online learning, and language model approaches can learn good counter-strategies and generalize well, but ultimately lose to the top-performing bots, creating an opportunity for research in multiagent learning.
翻译:机器学习与对抗规划领域的进展在很大程度上受益于基准测试域,从跳棋和经典UCI数据集到围棋和外交游戏皆如此。在序列决策中,智能体评估主要局限于与专家进行少量交互,旨在达到某种期望的性能水平(例如击败人类职业选手)。本文提出一个基于重复性简单游戏"石头-剪刀-布"的多智能体学习基准,包含由四十三种锦标赛参赛策略构成的群体(其中部分策略故意设计为次优)。我们描述了基于平均收益和可被利用性两个维度衡量智能体质量的指标。实验表明,多种强化学习、在线学习及语言模型方法能够习得有效的反制策略并展现良好的泛化能力,但最终仍会输给顶级性能的机器人——这为多智能体学习研究创造了新的机遇。