Evaluating generalist agents presents significant challenges due to their wide-ranging abilities and the limitations of current benchmarks in assessing true generalization. We introduce the Minecraft Universe (MCU), a fully automated benchmarking framework set within the open-world game Minecraft. MCU dynamically generates and evaluates a broad spectrum of tasks, offering three core components: 1) a task generation mechanism that provides high degrees of freedom and variability, 2) an ever-expanding set of over 3K composable atomic tasks, and 3) a general evaluation framework that supports open-ended task assessment. By integrating large language models (LLMs), MCU dynamically creates diverse environments for each evaluation, fostering agent generalization. The framework uses a vision-language model (VLM) to automatically generate evaluation criteria, achieving over 90% agreement with human ratings across multi-dimensional assessments, which demonstrates that MCU is a scalable and explainable solution for evaluating generalist agents. Additionally, we show that while state-of-the-art foundational models perform well on specific tasks, they often struggle with increased task diversity and difficulty.
翻译:评估通才智能体因其广泛的能力范围以及现有基准在评估真实泛化性能方面的局限性而面临重大挑战。我们提出了Minecraft Universe (MCU),一个在开放世界游戏《我的世界》中构建的完全自动化基准测试框架。MCU动态生成并评估广泛的任务谱系,提供三个核心组件:1) 具备高度自由度和可变性的任务生成机制,2) 不断扩展的包含超过3000个可组合原子任务的集合,3) 支持开放式任务评估的通用评估框架。通过集成大语言模型(LLMs),MCU为每次评估动态创建多样化环境,促进智能体的泛化能力。该框架利用视觉-语言模型(VLM)自动生成评估标准,在多维评估中与人工评分的一致性超过90%,这表明MCU是一个可扩展且可解释的通才智能体评估解决方案。此外,我们的研究表明,尽管最先进的基础模型在特定任务上表现良好,但在任务多样性和难度增加时往往面临挑战。