VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pokémon

Developing AI agents that can robustly adapt to varying strategic landscapes without retraining is a central challenge in multi-agent learning. Pokémon Video Game Championships (VGC) is a domain with a vast space of approximately $10^{139}$ team configurations, far larger than those of other games such as Chess, Go, Poker, StarCraft, or Dota. The combinatorial nature of team building in Pokémon VGC causes optimal strategies to vary substantially depending on both the controlled team and the opponent's team, making generalization uniquely challenging. To advance research on this problem, we introduce VGC-Bench: a benchmark that provides critical infrastructure, standardizes evaluation protocols, and supplies a human-play dataset of over 700,000 battle logs and a range of baseline agents based on heuristics, large language models, behavior cloning, and multi-agent reinforcement learning with empirical game-theoretic methods such as self-play, fictitious play, and double oracle. In the restricted setting where an agent is trained and evaluated in a mirror match with a single team configuration, our methods can win against a professional VGC competitor. We repeat this training and evaluation with progressively larger team sets and find that as the number of teams increases, the best-performing algorithm in the single-team setting has worse performance and is more exploitable, but has improved generalization to unseen teams. Our code and dataset are open-sourced at https://github.com/cameronangliss/vgc-bench and https://huggingface.co/datasets/cameronangliss/vgc-battle-logs.

翻译：开发能够在不重新训练的情况下稳健适应不同战略格局的AI智能体，是多智能体学习中的一个核心挑战。《宝可梦》视频游戏冠军赛（VGC）是一个拥有约$10^{139}$种队伍配置的广阔领域，其规模远超国际象棋、围棋、扑克、《星际争霸》或《Dota》等其他游戏。《宝可梦》VGC中队伍构建的组合特性，导致最优策略会因己方控制的队伍和对手的队伍而产生巨大差异，这使得泛化能力面临独特的挑战。为了推进这一问题的研究，我们推出了VGC-Bench：一个提供关键基础设施、标准化评估协议、提供包含超过70万条对战记录的人类对战数据集，以及一系列基于启发式方法、大语言模型、行为克隆和结合经验博弈论方法（如自我对弈、虚拟对弈和双重预言机）的多智能体强化学习的基线智能体的基准测试。在智能体使用单一队伍配置进行镜像对局训练和评估的受限设定下，我们的方法能够战胜专业的VGC选手。我们使用逐渐增大的队伍集合重复此训练和评估过程，发现随着队伍数量的增加，在单队伍设定下表现最佳的算法性能会下降且更易被利用，但其对未见队伍的泛化能力有所提升。我们的代码和数据集已在 https://github.com/cameronangliss/vgc-bench 和 https://huggingface.co/datasets/cameronangliss/vgc-battle-logs 开源。