Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.
翻译:自主人工智能研究智能体旨在通过自动化从假设生成到同行评审的研究流程来加速科学发现。然而,现有基准很少测试一个基本瓶颈:大型语言模型在投入时间和计算资源之前,能否判断研究思路的方法可行性。我们推出SoundnessBench,这是一个经过精心整理的基准,包含从ICLR投稿中重构的1099个机器学习研究提案,标注了审稿人的严谨性子评分,并对照原始论文进行了审计。SoundnessBench应被解释为可恢复的提案阶段严谨性基准,而非对完整论文评审结果的精确预测。在对12个前沿大型语言模型的评估中,我们发现普遍存在乐观偏差:在标准提示下,模型频繁将低严谨性提案评为严谨,而激进提示则主要将错误从假阳性转为假阴性。针对公共语料污染、论文标识短语、表面特征及人工审计质量的额外控制表明,这种偏差无法被单一混杂因素解释。我们的结果表明,当前大型语言模型尚不可靠,无法独立作为科学严谨性的首道评估关卡。