We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.
翻译:我们提出了GAIA,一个面向通用人工智能助手的基准测试,其解决将标志着人工智能研究的里程碑。GAIA提出了需要一系列基本能力(如推理、多模态处理、网络浏览及通用工具使用熟练度)的真实世界问题。GAIA问题对人类而言概念上简单,但对大多数先进AI却极具挑战性:我们显示,人类受访者获得92%的准确率,而配备插件的GPT-4仅为15%。这一显著的性能差距与近期大语言模型(LLMs)在法律或化学等专业领域任务上超越人类的表现趋势形成对比。GAIA的理念背离了当前AI基准测试中追求对人类越来越难的任务的趋势。我们认为,通用人工智能(AGI)的出现取决于系统在这些问题上展现出与普通人相当的稳健性。利用GAIA方法,我们设计了466个问题及其答案。我们公开了这些问题,同时保留了其中300个问题的答案,以支持一个可在https://huggingface.co/gaia-benchmark获取的排行榜。