Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and decision-making emerging as core general-purpose capabilities for adapting to diverse scenarios and tasks. Real-time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro-level strategic planning and micro-level tactical adaptation and action execution. Existing RTS game-based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel-based, textual, and structured game-state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision-making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape-X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game-based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(https://github.com/tb6147877/TowerMind).
翻译:大型语言模型(LLM)的最新突破使其成为智能体的一个有前景的范式,其中长期规划与决策制定正成为适应多样化场景与任务的核心通用能力。实时策略(RTS)游戏是评估这两项能力的理想测试平台,因为其固有的游戏玩法既需要宏观层面的战略规划,也需要微观层面的战术适应与动作执行。现有的基于RTS游戏的环境要么面临相对较高的计算需求,要么缺乏对文本观察的支持,这限制了RTS游戏在LLM评估中的应用。受此启发,我们提出了TowerMind,一个基于RTS游戏子类型——塔防(TD)游戏的新型环境。TowerMind保留了RTS游戏评估LLM的关键优势,同时具备低计算需求和多模态观察空间,包括基于像素的、文本的以及结构化的游戏状态表示。此外,TowerMind支持评估模型幻觉,并提供了高度的可定制性。我们设计了五个基准关卡,用于评估几种广泛使用的LLM在不同多模态输入设置下的表现。结果显示,在能力和幻觉两个维度上,LLM与人类专家之间存在明显的性能差距。实验进一步揭示了LLM行为的关键局限性,例如规划验证不足、决策缺乏多终局性以及动作使用效率低下。我们还评估了两种经典的强化学习算法:Ape-X DQN和PPO。通过提供轻量级和多模态的设计,TowerMind补充了现有的基于RTS游戏的环境格局,并为AI智能体领域引入了一个新的基准。源代码已在GitHub上公开(https://github.com/tb6147877/TowerMind)。