Towards Evaluating Generalist Agents: An Automated Benchmark in Open World

Evaluating generalist agents presents significant challenges due to their wide-ranging abilities and the limitations of current benchmarks in assessing true generalization. We introduce the Minecraft Universe (MCU), a fully automated benchmarking framework set within the open-world game Minecraft. MCU dynamically generates and evaluates a broad spectrum of tasks, offering three core components: 1) a task generation mechanism that provides high degrees of freedom and variability, 2) an ever-expanding set of over 3K composable atomic tasks, and 3) a general evaluation framework that supports open-ended task assessment. By integrating large language models (LLMs), MCU dynamically creates diverse environments for each evaluation, fostering agent generalization. The framework uses a vision-language model (VLM) to automatically generate evaluation criteria, achieving over 90% agreement with human ratings across multi-dimensional assessments, which demonstrates that MCU is a scalable and explainable solution for evaluating generalist agents. Additionally, we show that while state-of-the-art foundational models perform well on specific tasks, they often struggle with increased task diversity and difficulty.

翻译：评估通才智能体因其广泛的能力范围以及现有基准在评估真实泛化性能方面的局限性而面临重大挑战。我们提出了Minecraft Universe (MCU)，一个在开放世界游戏《我的世界》中构建的完全自动化基准测试框架。MCU动态生成并评估广泛的任务谱系，提供三个核心组件：1) 具备高度自由度和可变性的任务生成机制，2) 不断扩展的包含超过3000个可组合原子任务的集合，3) 支持开放式任务评估的通用评估框架。通过集成大语言模型(LLMs)，MCU为每次评估动态创建多样化环境，促进智能体的泛化能力。该框架利用视觉-语言模型(VLM)自动生成评估标准，在多维评估中与人工评分的一致性超过90%，这表明MCU是一个可扩展且可解释的通才智能体评估解决方案。此外，我们的研究表明，尽管最先进的基础模型在特定任务上表现良好，但在任务多样性和难度增加时往往面临挑战。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日