SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.

翻译：自主人工智能研究智能体旨在通过自动化从假设生成到同行评审的研究流程来加速科学发现。然而，现有基准很少测试一个基本瓶颈：大型语言模型在投入时间和计算资源之前，能否判断研究思路的方法可行性。我们推出SoundnessBench，这是一个经过精心整理的基准，包含从ICLR投稿中重构的1099个机器学习研究提案，标注了审稿人的严谨性子评分，并对照原始论文进行了审计。SoundnessBench应被解释为可恢复的提案阶段严谨性基准，而非对完整论文评审结果的精确预测。在对12个前沿大型语言模型的评估中，我们发现普遍存在乐观偏差：在标准提示下，模型频繁将低严谨性提案评为严谨，而激进提示则主要将错误从假阳性转为假阴性。针对公共语料污染、论文标识短语、表面特征及人工审计质量的额外控制表明，这种偏差无法被单一混杂因素解释。我们的结果表明，当前大型语言模型尚不可靠，无法独立作为科学严谨性的首道评估关卡。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

Claw AI Lab：从自动写论文到交互式AI研究实验室

专知会员服务

14+阅读 · 5月24日

AI能预测科学突破吗？CUSP基准揭示前沿模型能力边界

专知会员服务

9+阅读 · 5月23日

智能体评判者（Agent-as-a-Judge）研究综述

专知会员服务

37+阅读 · 1月9日

【AI4Science】利用大型语言模型变革科学：关于人工智能辅助科学发现、实验、内容生成与评估的调研

专知会员服务

33+阅读 · 2025年2月10日