Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.
翻译:游戏生成是编码智能体的一个新兴应用,要求模型将自然语言规范转化为可玩的交互系统。与传统的编码任务不同,游戏生成在游戏引擎内进行,其中脚本、场景、资源、渲染和运行时交互必须共同产生连贯的游戏玩法。我们将端到端游戏生成形式化为:在目标环境中,通过可观察的玩家-游戏交互来生成实现特定规范的完整游戏制品的问题。我们认为,评估这一设置需要三个要素:引擎接地性、制品完整性和交互式验证。我们提出了一种基于交互的评估框架,通过重放演示和基于评分的多模态评判来评估可执行游戏玩法。我们将该框架实例化为GameCraft-Bench,一个包含15个游戏家族共140个Godot任务的基准测试。对前沿编码智能体的评估表明,端到端游戏生成仍极具挑战性:最强智能体仅达到41.46%,大多数智能体得分低于40%。进一步分析发现,尽管智能体通常能实现可识别的机制,但难以提供包含充足内容、功能性视觉反馈和连贯呈现的完整游戏。演示、代码和数据请参见https://tongxuluo.github.io/gamecraft-bench-website。