OpenGuanDan: A Large-Scale Imperfect Information Game Benchmark

The advancement of data-driven artificial intelligence (AI), particularly machine learning, heavily depends on large-scale benchmarks. Despite remarkable progress across domains ranging from pattern recognition to intelligent decision-making in recent decades, exemplified by breakthroughs in board games, card games, and electronic sports games, there remains a pressing need for more challenging benchmarks to drive further research. To this end, this paper proposes OpenGuanDan, a novel benchmark that enables both efficient simulation of GuanDan (a popular four-player, multi-round Chinese card game) and comprehensive evaluation of both learning-based and rule-based GuanDan AI agents. OpenGuanDan poses a suite of nontrivial challenges, including imperfect information, large-scale information set and action spaces, a mixed learning objective involving cooperation and competition, long-horizon decision-making, variable action spaces, and dynamic team composition. These characteristics make it a demanding testbed for existing intelligent decision-making methods. Moreover, the independent API for each player allows human-AI interactions and supports integration with large language models. Empirically, we conduct two types of evaluations: (1) pairwise competitions among all GuanDan AI agents, and (2) human-AI matchups. Experimental results demonstrate that while current learning-based agents substantially outperform rule-based counterparts, they still fall short of achieving superhuman performance, underscoring the need for continued research in multi-agent intelligent decision-making domain. The project is publicly available at https://github.com/GameAI-NJUPT/OpenGuanDan.

翻译：数据驱动的人工智能（尤其是机器学习）的进步在很大程度上依赖于大规模基准测试。尽管近几十年来，从模式识别到智能决策的各个领域都取得了显著进展，例如在棋盘游戏、卡牌游戏和电子竞技游戏中取得的突破，但仍迫切需要更具挑战性的基准来推动进一步研究。为此，本文提出了OpenGuanDan，这是一个新颖的基准，既能高效模拟掼蛋（一种流行的四人多轮中国纸牌游戏），又能对基于学习和基于规则的掼蛋AI智能体进行全面评估。OpenGuanDan提出了一系列非平凡的挑战，包括非完美信息、大规模信息集和动作空间、涉及合作与竞争的混合学习目标、长时程决策、可变动作空间以及动态团队构成。这些特性使其成为对现有智能决策方法要求苛刻的测试平台。此外，每个玩家的独立API允许人机交互，并支持与大语言模型的集成。在实证研究中，我们进行了两种类型的评估：（1）所有掼蛋AI智能体之间的两两对抗，以及（2）人机对战。实验结果表明，尽管当前基于学习的智能体显著优于基于规则的对手，但它们仍未达到超人类水平，这凸显了在多智能体智能决策领域持续研究的必要性。该项目已在https://github.com/GameAI-NJUPT/OpenGuanDan 公开。