Developing agents capable of fluid gameplay in first/third-person games without API access remains a critical challenge in Artificial General Intelligence (AGI). Recent efforts leverage Vision Language Models (VLMs) as direct controllers, frequently pausing the game to analyze screens and plan action through language reasoning. However, this inefficient paradigm fundamentally restricts agents to basic and non-fluent interactions: relying on isolated VLM reasoning for each action makes it impossible to handle tasks requiring high reactivity (e.g., FPS shooting) or dynamic adaptability (e.g., ACT combat). To handle this, we propose a paradigm shift in gameplay agent design: instead of directly controlling gameplay, VLM develops specialized execution modules tailored for tasks like shooting and combat. These modules handle real-time game interactions, elevating VLM to a high-level developer. Building upon this paradigm, we introduce GameSense, a gameplay agent framework where VLM develops task-specific game sense modules by observing task execution and leveraging vision tools and neural network training pipelines. These modules encapsulate action-feedback logic, ranging from direct action rules to neural network-based decisions. Experiments demonstrate that our framework is the first to achieve fluent gameplay in diverse genres, including ACT, FPS, and Flappy Bird, setting a new benchmark for game-playing agents.
翻译:开发能够在无需API访问的情况下流畅游玩第一/第三人称游戏的智能体,仍然是通往通用人工智能(AGI)的关键挑战。近期研究利用视觉语言模型(VLMs)作为直接控制器,频繁暂停游戏以分析屏幕画面并通过语言推理规划行动。然而,这种低效范式从根本上限制了智能体只能进行基础且不流畅的交互:依赖针对每个动作的孤立VLM推理,使其无法处理需要高反应性(例如FPS射击)或动态适应性(例如ACT战斗)的任务。为解决此问题,我们提出了一种游戏智能体设计的范式转变:VLM不再直接控制游戏操作,而是开发专门针对射击、战斗等任务的专用执行模块。这些模块处理实时游戏交互,从而将VLM提升至高级开发者的角色。基于此范式,我们提出了GameSense——一个游戏智能体框架,其中VLM通过观察任务执行过程,并利用视觉工具和神经网络训练流程,开发出针对特定任务的游戏感知模块。这些模块封装了从直接动作规则到基于神经网络的决策等不同层级的动作-反馈逻辑。实验表明,我们的框架首次在包括ACT、FPS和Flappy Bird在内的多种游戏类型中实现了流畅的游戏操作,为游戏智能体树立了新的基准。