Large language models (LLMs) have shown impressive capabilities in generating program code, opening exciting opportunities for applying program synthesis to games. In this work, we explore the potential of LLMs to directly synthesize usable code for a wide range of gaming applications, focusing on two programming languages, Python and Java. We use an evolutionary hill-climbing algorithm, where the mutations and seeds of the initial programs are controlled by LLMs. For Python, the framework covers various game-related tasks, including five miniature versions of Atari games, ten levels of Baba is You, an environment inspired by Asteroids, and a maze generation task. For Java, the framework contains 12 games from the TAG tabletop games framework. Across 29 tasks, we evaluated 12 language models for Python and 8 for Java. Our findings suggest that the performance of LLMs depends more on the task than on model size. While larger models generate more executable programs, these do not always result in higher-quality solutions but are much more expensive. No model has a clear advantage, although on any specific task, one model may be better. Trying many models on a problem and using the best results across them is more reliable than using just one.
翻译:大型语言模型(LLMs)在生成程序代码方面展现出卓越的能力,为将程序合成技术应用于游戏领域开辟了令人振奋的前景。本研究探索了LLMs直接为广泛游戏应用合成可用代码的潜力,重点关注Python和Java两种编程语言。我们采用一种进化爬山算法,其中初始程序的变异与种子均由LLMs控制。对于Python,该框架涵盖多种游戏相关任务,包括五个简化版Atari游戏、十个《Baba is You》关卡、一个受《Asteroids》启发的环境以及一个迷宫生成任务。对于Java,该框架包含来自TAG桌面游戏框架的12款游戏。在总计29项任务中,我们评估了12个适用于Python的语言模型和8个适用于Java的模型。研究结果表明,LLMs的表现更多取决于任务特性而非模型规模。虽然更大规模的模型能生成更多可执行程序,但这些程序并不总能产生更高质量的解决方案,且计算成本显著更高。尽管在特定任务上某个模型可能表现更优,但没有任何模型具备绝对优势。针对具体问题尝试多种模型并综合其最佳结果,比仅使用单一模型更为可靠。