Recent advances in Large Language Models (LLMs) have demonstrated their potential as autonomous agents across various tasks. One emerging application is the use of LLMs in playing games. In this work, we explore a practical problem for the gaming industry: Can LLMs be used to measure game difficulty? We propose a general game-testing framework using LLM agents and test it on two widely played strategy games: Wordle and Slay the Spire. Our results reveal an interesting finding: although LLMs may not perform as well as the average human player, their performance, when guided by simple, generic prompting techniques, shows a statistically significant and strong correlation with difficulty indicated by human players. This suggests that LLMs could serve as effective agents for measuring game difficulty during the development process. Based on our experiments, we also outline general principles and guidelines for incorporating LLMs into the game testing process.
翻译:近年来,大语言模型(LLMs)作为自主智能体在各类任务中展现出巨大潜力。其中一项新兴应用是LLMs在游戏领域的运用。本研究针对游戏产业的一个实际问题展开探索:能否利用LLMs来量化游戏难度?我们提出了一种基于LLM智能体的通用游戏测试框架,并在两款广受欢迎的策略游戏《Wordle》和《杀戮尖塔》上进行了验证。实验结果揭示了一个有趣的现象:尽管LLMs的表现可能不及普通人类玩家,但在采用简单通用的提示技术引导下,其表现与人类玩家感知的难度之间呈现出统计学上显著且强烈的相关性。这表明在游戏开发过程中,LLMs可作为衡量游戏难度的有效智能体。基于实验发现,我们进一步提出了将LLMs整合至游戏测试流程的通用原则与实施指南。