The advancement of data-driven artificial intelligence (AI), particularly machine learning, heavily depends on large-scale benchmarks. Despite remarkable progress across domains ranging from pattern recognition to intelligent decision-making in recent decades, exemplified by breakthroughs in board games, card games, and electronic sports games, there remains a pressing need for more challenging benchmarks to drive further research. To this end, this paper proposes OpenGuanDan, a novel benchmark that enables both efficient simulation of GuanDan (a popular four-player, multi-round Chinese card game) and comprehensive evaluation of both learning-based and rule-based GuanDan AI agents. OpenGuanDan poses a suite of nontrivial challenges, including imperfect information, large-scale information set and action spaces, a mixed learning objective involving cooperation and competition, long-horizon decision-making, variable action spaces, and dynamic team composition. These characteristics make it a demanding testbed for existing intelligent decision-making methods. Moreover, the independent API for each player allows human-AI interactions and supports integration with large language models. Empirically, we conduct two types of evaluations: (1) pairwise competitions among all GuanDan AI agents, and (2) human-AI matchups. Experimental results demonstrate that while current learning-based agents substantially outperform rule-based counterparts, they still fall short of achieving superhuman performance, underscoring the need for continued research in multi-agent intelligent decision-making domain. The project is publicly available at https://github.com/GameAI-NJUPT/OpenGuanDan.
翻译:数据驱动的人工智能(尤其是机器学习)的进步在很大程度上依赖于大规模基准测试。尽管近几十年来,从模式识别到智能决策的各个领域都取得了显著进展,例如在棋盘游戏、卡牌游戏和电子竞技游戏中取得的突破,但仍迫切需要更具挑战性的基准来推动进一步研究。为此,本文提出了OpenGuanDan,这是一个新颖的基准,既能高效模拟掼蛋(一种流行的四人多轮中国纸牌游戏),又能对基于学习和基于规则的掼蛋AI智能体进行全面评估。OpenGuanDan提出了一系列非平凡的挑战,包括非完美信息、大规模信息集和动作空间、涉及合作与竞争的混合学习目标、长时程决策、可变动作空间以及动态团队构成。这些特性使其成为对现有智能决策方法要求苛刻的测试平台。此外,每个玩家的独立API允许人机交互,并支持与大语言模型的集成。在实证研究中,我们进行了两种类型的评估:(1)所有掼蛋AI智能体之间的两两对抗,以及(2)人机对战。实验结果表明,尽管当前基于学习的智能体显著优于基于规则的对手,但它们仍未达到超人类水平,这凸显了在多智能体智能决策领域持续研究的必要性。该项目已在https://github.com/GameAI-NJUPT/OpenGuanDan 公开。