GAIA: a benchmark for General AI Assistants

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.

翻译：我们提出了GAIA，一个面向通用人工智能助手的基准测试，其解决将标志着人工智能研究的里程碑。GAIA提出了需要一系列基本能力（如推理、多模态处理、网络浏览及通用工具使用熟练度）的真实世界问题。GAIA问题对人类而言概念上简单，但对大多数先进AI却极具挑战性：我们显示，人类受访者获得92%的准确率，而配备插件的GPT-4仅为15%。这一显著的性能差距与近期大语言模型（LLMs）在法律或化学等专业领域任务上超越人类的表现趋势形成对比。GAIA的理念背离了当前AI基准测试中追求对人类越来越难的任务的趋势。我们认为，通用人工智能（AGI）的出现取决于系统在这些问题上展现出与普通人相当的稳健性。利用GAIA方法，我们设计了466个问题及其答案。我们公开了这些问题，同时保留了其中300个问题的答案，以支持一个可在https://huggingface.co/gaia-benchmark获取的排行榜。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日