CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark - 专知论文

会员服务 ·

0

Agent · 模型评估 · AI · 论文 · AI Agent ·

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

翻译：暂无翻译

Zachary S. Siegel,Sayash Kapoor,Nitya Nadgir,Benedikt Stroebl,Arvind Narayanan

from arxiv, Benchmark harness and code available at http://github.com/siegelz/core-bench

AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.

翻译：暂无翻译

0

相关内容

Agent

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

专知会员服务

41+阅读 · 2025年10月17日

从Idea构想到论文发表：AI for Research全链路综述与实践

从Idea构想到论文发表：AI for Research全链路综述与实践

专知会员服务

24+阅读 · 2025年7月21日

AI行业专题报告：工具生态逐步完善，通用Agent曙光已现

AI行业专题报告：工具生态逐步完善，通用Agent曙光已现

专知会员服务

33+阅读 · 2025年3月27日

中国AI Agent行业研究报告（二）

中国AI Agent行业研究报告（二）

专知会员服务

48+阅读 · 2025年3月13日

《关于未来人工智能研究的报告》最新91页

《关于未来人工智能研究的报告》最新91页

专知会员服务

53+阅读 · 2025年3月2日

2024中国AI Agent行业研究报告｜附60页PDF文件下载

2024中国AI Agent行业研究报告｜附60页PDF文件下载

专知会员服务

127+阅读 · 2024年4月30日

AI Agent，大模型时代重要落地方向, 42页ppt

AI Agent，大模型时代重要落地方向, 42页ppt

专知会员服务

291+阅读 · 2023年10月12日

【综述】超参数优化:算法和应用综述，Hyper-Parameter Optimization: A Review of Algorithms and Applications

【综述】超参数优化:算法和应用综述，Hyper-Parameter Optimization: A Review of Algorithms and Applications

专知会员服务

57+阅读 · 2020年3月13日

【北京智源大会2019】增强人类智能：从搜索引擎到智能任务助理（ Augmenting Human Intelligence: From Search Engines to Intelligent Task Assistants ）

【北京智源大会2019】增强人类智能：从搜索引擎到智能任务助理（ Augmenting Human Intelligence: From Search Engines to Intelligent Task Assistants ）

专知会员服务

20+阅读 · 2019年11月22日

PaperRobot: Automated Scientific Knowledge Graph Construction and Paper Writing，伊利诺伊大学香槟分校计算机科学系Heng Ji教授，CCKS-2019：知识智能

PaperRobot: Automated Scientific Knowledge Graph Construction and Paper Writing，伊利诺伊大学香槟分校计算机科学系Heng Ji教授，CCKS-2019：知识智能

专知会员服务

32+阅读 · 2019年10月25日

【干货书】《Transformers 机器学习:深度探究》，284页pdf

【干货书】《Transformers 机器学习:深度探究》，284页pdf

专知

72+阅读 · 2022年4月21日

重磅！最新AI药物研发：白皮书、国内外技术报告、干货书、综述论文、关键技术最新论文（含实现代码）、数据集、教程课程讲解

重磅！最新AI药物研发：白皮书、国内外技术报告、干货书、综述论文、关键技术最新论文（含实现代码）、数据集、教程课程讲解

GenomicAI

14+阅读 · 2022年2月19日

【2020新书】图机器学习，Graph-Powered Machine Learning

【2020新书】图机器学习，Graph-Powered Machine Learning

专知

76+阅读 · 2020年1月27日

类脑计算的前沿论文，看我们推荐的这7篇

类脑计算的前沿论文，看我们推荐的这7篇

人工智能前沿讲习班

21+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

AI综述专栏 | 深度神经网络加速与压缩

AI综述专栏 | 深度神经网络加速与压缩

人工智能前沿讲习班

32+阅读 · 2018年10月31日

【书籍】深度学习框架：PyTorch入门与实践（附代码）

【书籍】深度学习框架：PyTorch入门与实践（附代码）

专知

16+阅读 · 2018年1月21日

Deepmind 新成果，让机器挑战更复杂阅读理解问题

Deepmind 新成果，让机器挑战更复杂阅读理解问题

AI掘金志

11+阅读 · 2018年1月3日

【综述】最新7篇数据科学/深度学习/CNN/知识图谱/文本匹配等中英文综述论文推介（附下载）

【综述】最新7篇数据科学/深度学习/CNN/知识图谱/文本匹配等中英文综述论文推介（附下载）

机器学习研究会

16+阅读 · 2017年12月3日

核心化算法中的新技术研究

国家自然科学基金

8+阅读 · 2017年12月31日

基于主题网络的用户内在兴趣发现及演进研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向物联网搜索的群智感知关键技术研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

47+阅读 · 2015年12月31日

神经形态多核处理器的架构模型研究

国家自然科学基金

3+阅读 · 2015年12月31日

面向大数据的知识表示、推理、在线学习理论及应用研究

国家自然科学基金

12+阅读 · 2014年12月31日

基于网络的情感语义词典的自动构建技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向词汇功能的学术文本语义识别与知识图谱构建

国家自然科学基金

5+阅读 · 2014年12月31日

CPU和GPU混合体系结构上生物网络比对并行算法研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于群体智能的多Agent协作模型与适应性研究

国家自然科学基金

18+阅读 · 2009年12月31日

Human-Centered Design: The Disclosure of Generative Artificial Intelligence for Emerging Professionals

Arxiv

0+阅读 · 6月23日

Intent-Governed Tool Authorization for AI Agents

Arxiv

0+阅读 · 6月22日

AI Scientists as Engines of Discovery: A Case for Development within Reformed Institutions

Arxiv

0+阅读 · 6月22日

DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

Arxiv

0+阅读 · 6月20日

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

Arxiv

0+阅读 · 6月18日

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

Arxiv

0+阅读 · 6月17日

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

Arxiv

0+阅读 · 6月17日

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

Arxiv

0+阅读 · 6月17日

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

Arxiv

0+阅读 · 6月17日

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Arxiv

0+阅读 · 6月17日

VIP会员

文章信息

相关主题

最新内容

反无人机拦截器训练与运用课程：对美国陆军部队发展的启示

反无人机拦截器训练与运用课程：对美国陆军部队发展的启示

专知会员服务

5+阅读 · 今天8:00

重新思考无人机时代的生存能力

重新思考无人机时代的生存能力

专知会员服务

3+阅读 · 今天7:44

装甲突击旅：现代战争思考、战斗与组织

装甲突击旅：现代战争思考、战斗与组织

专知会员服务

3+阅读 · 今天7:28

在人工智能加速决策环境中拓展OODA循环

在人工智能加速决策环境中拓展OODA循环

专知会员服务

4+阅读 · 今天7:18

《廉价自杀式无人机战争的军事战略影响：乌克兰与伊朗案例研究》

《廉价自杀式无人机战争的军事战略影响：乌克兰与伊朗案例研究》

专知会员服务

5+阅读 · 今天7:07

军事欺骗：供作战战术指挥官使用的工具

军事欺骗：供作战战术指挥官使用的工具

专知会员服务

4+阅读 · 今天7:03

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

专知会员服务

4+阅读 · 6月23日

综述 | 世界动作模型：少做梦，多行动

综述 | 世界动作模型：少做梦，多行动

专知会员服务

5+阅读 · 6月23日

美以伊冲突：无人机与人工智能的运用

美以伊冲突：无人机与人工智能的运用

专知会员服务

10+阅读 · 6月23日

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

专知会员服务

4+阅读 · 6月23日

《特种部队在透明战场中的生存力》最新报告

《特种部队在透明战场中的生存力》最新报告

专知会员服务

5+阅读 · 6月23日

《自主无人机蜂群协同与控制系统：人工智能赋能的战场协同与自主任务编排平台》

《自主无人机蜂群协同与控制系统：人工智能赋能的战场协同与自主任务编排平台》

专知会员服务

8+阅读 · 6月23日

《人工智能生成的零日漏洞：对未来作战的影响》

《人工智能生成的零日漏洞：对未来作战的影响》

专知会员服务

7+阅读 · 6月23日

《理解伙伴国在防务能力选择中的偏好：探索美国解决方案的替代选择》美智库200页报告

《理解伙伴国在防务能力选择中的偏好：探索美国解决方案的替代选择》美智库200页报告

专知会员服务

4+阅读 · 6月23日

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

专知会员服务

6+阅读 · 6月22日

相关VIP内容

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

专知会员服务

41+阅读 · 2025年10月17日

从Idea构想到论文发表：AI for Research全链路综述与实践

从Idea构想到论文发表：AI for Research全链路综述与实践

专知会员服务

24+阅读 · 2025年7月21日

AI行业专题报告：工具生态逐步完善，通用Agent曙光已现

AI行业专题报告：工具生态逐步完善，通用Agent曙光已现

专知会员服务

33+阅读 · 2025年3月27日

中国AI Agent行业研究报告（二）

中国AI Agent行业研究报告（二）

专知会员服务

48+阅读 · 2025年3月13日

《关于未来人工智能研究的报告》最新91页

《关于未来人工智能研究的报告》最新91页

专知会员服务

53+阅读 · 2025年3月2日

2024中国AI Agent行业研究报告｜附60页PDF文件下载

2024中国AI Agent行业研究报告｜附60页PDF文件下载

专知会员服务

127+阅读 · 2024年4月30日

AI Agent，大模型时代重要落地方向, 42页ppt

AI Agent，大模型时代重要落地方向, 42页ppt

专知会员服务

291+阅读 · 2023年10月12日

【综述】超参数优化:算法和应用综述，Hyper-Parameter Optimization: A Review of Algorithms and Applications

【综述】超参数优化:算法和应用综述，Hyper-Parameter Optimization: A Review of Algorithms and Applications

专知会员服务

57+阅读 · 2020年3月13日

【北京智源大会2019】增强人类智能：从搜索引擎到智能任务助理（ Augmenting Human Intelligence: From Search Engines to Intelligent Task Assistants ）

【北京智源大会2019】增强人类智能：从搜索引擎到智能任务助理（ Augmenting Human Intelligence: From Search Engines to Intelligent Task Assistants ）

专知会员服务

20+阅读 · 2019年11月22日

PaperRobot: Automated Scientific Knowledge Graph Construction and Paper Writing，伊利诺伊大学香槟分校计算机科学系Heng Ji教授，CCKS-2019：知识智能

PaperRobot: Automated Scientific Knowledge Graph Construction and Paper Writing，伊利诺伊大学香槟分校计算机科学系Heng Ji教授，CCKS-2019：知识智能

专知会员服务

32+阅读 · 2019年10月25日

热门VIP内容

开通专知VIP会员享更多权益服务

重新思考无人机时代的生存能力

在人工智能加速决策环境中拓展OODA循环

反无人机拦截器训练与运用课程：对美国陆军部队发展的启示

装甲突击旅：现代战争思考、战斗与组织

相关资讯

【干货书】《Transformers 机器学习:深度探究》，284页pdf

【干货书】《Transformers 机器学习:深度探究》，284页pdf

专知

72+阅读 · 2022年4月21日

重磅！最新AI药物研发：白皮书、国内外技术报告、干货书、综述论文、关键技术最新论文（含实现代码）、数据集、教程课程讲解

重磅！最新AI药物研发：白皮书、国内外技术报告、干货书、综述论文、关键技术最新论文（含实现代码）、数据集、教程课程讲解

GenomicAI

14+阅读 · 2022年2月19日

【2020新书】图机器学习，Graph-Powered Machine Learning

【2020新书】图机器学习，Graph-Powered Machine Learning

专知

76+阅读 · 2020年1月27日

类脑计算的前沿论文，看我们推荐的这7篇

类脑计算的前沿论文，看我们推荐的这7篇

人工智能前沿讲习班

21+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

AI综述专栏 | 深度神经网络加速与压缩

AI综述专栏 | 深度神经网络加速与压缩

人工智能前沿讲习班

32+阅读 · 2018年10月31日

【书籍】深度学习框架：PyTorch入门与实践（附代码）

【书籍】深度学习框架：PyTorch入门与实践（附代码）

专知

16+阅读 · 2018年1月21日

Deepmind 新成果，让机器挑战更复杂阅读理解问题

Deepmind 新成果，让机器挑战更复杂阅读理解问题

AI掘金志

11+阅读 · 2018年1月3日

【综述】最新7篇数据科学/深度学习/CNN/知识图谱/文本匹配等中英文综述论文推介（附下载）

【综述】最新7篇数据科学/深度学习/CNN/知识图谱/文本匹配等中英文综述论文推介（附下载）

机器学习研究会

16+阅读 · 2017年12月3日

相关论文

Human-Centered Design: The Disclosure of Generative Artificial Intelligence for Emerging Professionals

Arxiv

0+阅读 · 6月23日

Intent-Governed Tool Authorization for AI Agents

Arxiv

0+阅读 · 6月22日

AI Scientists as Engines of Discovery: A Case for Development within Reformed Institutions

Arxiv

0+阅读 · 6月22日

DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

Arxiv

0+阅读 · 6月20日

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

Arxiv

0+阅读 · 6月18日

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

Arxiv

0+阅读 · 6月17日

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

Arxiv

0+阅读 · 6月17日

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

Arxiv

0+阅读 · 6月17日

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

Arxiv

0+阅读 · 6月17日

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Arxiv

0+阅读 · 6月17日

相关基金

核心化算法中的新技术研究

国家自然科学基金

8+阅读 · 2017年12月31日

基于主题网络的用户内在兴趣发现及演进研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向物联网搜索的群智感知关键技术研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

47+阅读 · 2015年12月31日

神经形态多核处理器的架构模型研究

国家自然科学基金

3+阅读 · 2015年12月31日

面向大数据的知识表示、推理、在线学习理论及应用研究

国家自然科学基金

12+阅读 · 2014年12月31日

基于网络的情感语义词典的自动构建技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向词汇功能的学术文本语义识别与知识图谱构建

国家自然科学基金

5+阅读 · 2014年12月31日

CPU和GPU混合体系结构上生物网络比对并行算法研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于群体智能的多Agent协作模型与适应性研究

国家自然科学基金

18+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员