We present Plancraft, a multi-modal evaluation dataset for LLM agents. Plancraft has both a text-only and multi-modal interface, based on the Minecraft crafting GUI. We include the Minecraft Wiki to evaluate tool use and Retrieval Augmented Generation (RAG), as well as an oracle planner and oracle RAG information extractor, to ablate the different components of a modern agent architecture. To evaluate decision-making, Plancraft also includes a subset of examples that are intentionally unsolvable, providing a realistic challenge that requires the agent not only to complete tasks but also to decide whether they are solvable at all. We benchmark both open-source and closed-source LLMs and strategies on our task and compare their performance to a handcrafted planner. We find that LLMs and VLMs struggle with the planning problems that Plancraft introduces, and we offer suggestions on how to improve their capabilities.
翻译:本文提出了Plancraft,一个用于LLM智能体的多模态评估数据集。Plancraft基于《我的世界》合成界面,同时提供纯文本与多模态交互接口。我们引入《我的世界》维基百科以评估工具使用与检索增强生成(RAG)能力,并配备理想规划器与理想RAG信息提取器,用于解构现代智能体架构中各组件的功能。为评估决策能力,Plancraft还包含特意设计为不可解的子任务集,这为智能体提供了需同时判断任务可解性与执行能力的现实挑战。我们在该任务上对开源与闭源LLM及其策略进行基准测试,并将其性能与手工构建的规划器进行对比。研究发现,LLM与VLM在Plancraft提出的规划问题上表现欠佳,本文最后提出了提升其能力的改进建议。