PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace,Oliver Jaffe,Dane Sherburn,James Aung,Jun Shern Chan,Leon Maksin,Rachel Dias,Evan Mays,Benjamin Kinsella,Wyatt Thompson,Johannes Heidecke,Amelia Glaese,Tejal Patwardhan

from arxiv, 30 pages, 14 figures

We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0\%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We \href{https://github.com/openai/preparedness}{open-source our code} to facilitate future research in understanding the AI engineering capabilities of AI agents.

翻译：本文介绍PaperBench，这是一个评估AI智能体复制前沿AI研究能力的基准测试。智能体必须从零开始复制20篇ICML 2024焦点报告与口头报告论文，包括理解论文贡献、开发代码库以及成功执行实验。为实现客观评估，我们开发了分层评估标准，将每个复制任务分解为具有明确评分标准的子任务。PaperBench共包含8,316个可独立评分的任务。评估标准与各ICML论文作者共同开发以确保准确性和真实性。为实现可扩展评估，我们还开发了基于LLM的自动评分器，依据评估标准对复制尝试进行自动评分，并通过创建独立的评分器基准来评估其性能。我们在PaperBench上评估了多个前沿模型，发现表现最佳的测试智能体（配备开源框架的Claude 3.5 Sonnet (New)）平均复制得分仅为21.0%。最后，我们招募顶尖机器学习博士生尝试PaperBench的子集任务，发现当前模型性能尚未超越人类基线。我们\href{https://github.com/openai/preparedness}{开源代码}以促进未来对AI智能体工程化能力的研究。

相关内容

关注 7107

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日