We introduce a novel and extensible benchmark for large language models (LLMs) through grid-based games such as Tic-Tac-Toe, Connect Four, and Gomoku. The open-source game simulation code, available on GitHub, allows LLMs to compete and generates detailed data files in JSON, CSV, TXT, and PNG formats for leaderboard rankings and further analysis. We present the results of games among leading LLMs, including Claude 3.5 Sonnet and Claude 3 Sonnet by Anthropic, Gemini 1.5 Pro and Gemini 1.5 Flash by Google, GPT-4 Turbo and GPT-4o by OpenAI, and Llama3-70B by Meta. We also encourage submissions of results from other LLMs. In total, we simulated 2,310 matches (5 sessions for each pair among 7 LLMs and a random player) across three types of games, using three distinct prompt types: list, illustration, and image. The results revealed significant variations in LLM performance across different games and prompt types, with analysis covering win and disqualification rates, missed opportunity analysis, and invalid move analysis. The details of the leaderboard and result matrix data are available as open-access data on GitHub. This study enhances our understanding of LLMs' capabilities in playing games they were not specifically trained for, helping to assess their rule comprehension and strategic thinking. On the path to Artificial General Intelligence (AGI), this study lays the groundwork for future exploration into their utility in complex decision-making scenarios, illuminating their strategic thinking abilities and offering directions for further inquiry into the limits of LLMs within game-based frameworks.
翻译:我们通过井字棋、四子棋和五子棋等网格游戏,提出了一种新颖且可扩展的大型语言模型(LLM)基准测试方法。开源的游戏模拟代码已在GitHub上发布,允许LLMs相互竞争,并生成JSON、CSV、TXT和PNG格式的详细数据文件,用于排行榜排名和进一步分析。我们展示了包括Anthropic的Claude 3.5 Sonnet和Claude 3 Sonnet、Google的Gemini 1.5 Pro和Gemini 1.5 Flash、OpenAI的GPT-4 Turbo和GPT-4o,以及Meta的Llama3-70B在内的领先LLMs之间的游戏结果。我们也鼓励提交其他LLMs的结果。总计,我们在三种游戏类型中模拟了2,310场比赛(7个LLM与一个随机玩家之间每对进行5个回合),使用了三种不同的提示类型:列表、图示和图像。结果显示,LLMs在不同游戏和提示类型下的表现存在显著差异,分析涵盖了胜率和取消资格率、错失机会分析以及无效移动分析。排行榜和结果矩阵数据的详细信息作为开放访问数据在GitHub上提供。本研究增进了我们对LLMs在未专门训练的游戏中所展现能力的理解,有助于评估其规则理解和战略思维。在通往通用人工智能(AGI)的道路上,本研究为未来探索LLMs在复杂决策场景中的效用奠定了基础,阐明了其战略思维能力,并为在基于游戏的框架内进一步探究LLMs的极限提供了方向。