We present \textsc{MineNPC-Task}, a user-authored benchmark and evaluation harness for testing memory-aware, mixed-initiative LLM agents in open-world \emph{Minecraft}. Rather than relying on synthetic prompts, tasks are elicited from formative and summative co-play with expert players, normalized into parametric templates with explicit preconditions and dependency structure, and paired with machine-checkable validators under a bounded-knowledge policy that forbids out-of-world shortcuts. The harness captures plan/act/memory events-including plan previews, targeted clarifications, memory reads and writes, precondition checks, and repair attempts and reports outcomes relative to the total number of attempted subtasks, derived from in-world evidence. As an initial snapshot, we instantiate the framework with GPT-4o and evaluate \textbf{216} subtasks across \textbf{8} experienced players. We observe recurring breakdown patterns in code execution, inventory/tool handling, referencing, and navigation, alongside recoveries supported by mixed-initiative clarifications and lightweight memory. Participants rated interaction quality and interface usability positively, while highlighting the need for stronger memory persistence across tasks. We release the complete task suite, validators, logs, and harness to support transparent, reproducible evaluation of future memory-aware embodied agents.
翻译:本文提出 \textsc{MineNPC-Task},这是一个由用户创建的基准测试与评估框架,用于在开放世界游戏 \emph{Minecraft} 中测试具备记忆感知能力的、混合主动式大型语言模型智能体。该框架中的任务并非基于合成提示生成,而是通过与资深玩家进行形成性与总结性协同游戏而获得,并归一化为具有明确前置条件与依赖关系的参数化模板。这些任务与机器可验证的校验器配对,并在一个禁止使用游戏世界外捷径的有限知识策略下运行。该框架捕获计划/行动/记忆事件——包括计划预览、针对性澄清、记忆读取与写入、前置条件检查以及修复尝试——并根据从游戏世界内证据推导出的已尝试子任务总数来报告结果。作为初步成果,我们使用 GPT-4o 实例化了该框架,并评估了来自 \textbf{8} 位经验玩家的共计 \textbf{216} 项子任务。我们观察到在代码执行、物品栏/工具处理、引用和导航方面反复出现的故障模式,同时也看到了通过混合主动式澄清和轻量级记忆支持实现的恢复情况。参与者对交互质量和界面可用性给予了积极评价,同时强调了跨任务记忆持久性需要加强的需求。我们发布了完整的任务套件、校验器、日志及框架,以支持对未来记忆感知型具身智能体进行透明、可复现的评估。