从代码到游戏：基于大型语言模型的程序搜索基准测试 (From Code to Play: Benchmarking Program Search for Games Using Large Language Models)

Large language models (LLMs) have shown impressive capabilities in generating program code, opening exciting opportunities for applying program synthesis to games. In this work, we explore the potential of LLMs to directly synthesize usable code for a wide range of gaming applications, focusing on two programming languages, Python and Java. We use an evolutionary hill-climbing algorithm, where the mutations and seeds of the initial programs are controlled by LLMs. For Python, the framework covers various game-related tasks, including five miniature versions of Atari games, ten levels of Baba is You, an environment inspired by Asteroids, and a maze generation task. For Java, the framework contains 12 games from the TAG tabletop games framework. Across 29 tasks, we evaluated 12 language models for Python and 8 for Java. Our findings suggest that the performance of LLMs depends more on the task than on model size. While larger models generate more executable programs, these do not always result in higher-quality solutions but are much more expensive. No model has a clear advantage, although on any specific task, one model may be better. Trying many models on a problem and using the best results across them is more reliable than using just one.

翻译：大型语言模型（LLMs）在生成程序代码方面展现出卓越的能力，为将程序合成技术应用于游戏领域开辟了令人振奋的前景。本研究探索了LLMs直接为广泛游戏应用合成可用代码的潜力，重点关注Python和Java两种编程语言。我们采用一种进化爬山算法，其中初始程序的变异与种子均由LLMs控制。对于Python，该框架涵盖多种游戏相关任务，包括五个简化版Atari游戏、十个《Baba is You》关卡、一个受《Asteroids》启发的环境以及一个迷宫生成任务。对于Java，该框架包含来自TAG桌面游戏框架的12款游戏。在总计29项任务中，我们评估了12个适用于Python的语言模型和8个适用于Java的模型。研究结果表明，LLMs的表现更多取决于任务特性而非模型规模。虽然更大规模的模型能生成更多可执行程序，但这些程序并不总能产生更高质量的解决方案，且计算成本显著更高。尽管在特定任务上某个模型可能表现更优，但没有任何模型具备绝对优势。针对具体问题尝试多种模型并综合其最佳结果，比仅使用单一模型更为可靠。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/