Scientific peer review increasingly struggles to assess reproducibility at the scale and complexity of modern research output. Evaluating reproducibility requires reconstructing experimental dependencies, methodological choices, data flows, and result-generating procedures, which often exceeds what human reviewers can provide. Agentic Reproducibility Assessment (ARA) formalizes reproducibility assessment as a structured reasoning task over scientific documents. Given a paper, ARA extracts a directed workflow graph linking sources, methods, experiments, and outputs, then evaluates its reconstructability using structural and content-based scores for reproducibility assessments. Experiments on 213 ReScience C articles - the largest cross-domain benchmark of human-validated computational reproducibility studies considered to date - demonstrate ARA's generalizability and consistent workflow reconstruction and assessment across LLMs, model temperatures, and scientific domains. ARA achieves ~61% accuracy on three benchmarks, and the highest accuracy reported on ReproBench (60.71% vs. 36.84%) and GoldStandardDB (61.68% vs. 43.56%), highlighting its potential to complement human review at scale and enabling next-generation peer review. Code and Data available: https://github.com/AndresLaverdeMarin/agentic_reproducibility_assessment.
翻译:科学同行评审在应对现代研究成果的规模和复杂性时,越来越难以评估其可复现性。评估可复现性需要重构实验依赖关系、方法论选择、数据流以及结果生成流程,这往往超出了人类评审员的能力范围。代理式可复现性评估(ARA)将可复现性评估形式化为一个基于科学文档的结构化推理任务。给定一篇论文,ARA提取一个有向工作流图,将来源、方法、实验和输出关联起来,然后使用基于结构和内容的可复现性评估分数来评估其可重构性。在213篇ReScience C文章(迄今为止最大规模的跨领域人工验证计算可复现性研究基准)上的实验表明,ARA具有泛化能力,并且在LLM、模型温度和科学领域上可实现一致的工作流重构和评估。ARA在三个基准上达到了约61%的准确率,并在ReproBench(60.71%对36.84%)和GoldStandardDB(61.68%对43.56%)上报告了最高准确率,突显了其在大规模补充人工评审方面的潜力,从而推动了下一代同行评审的发展。代码与数据获取地址:https://github.com/AndresLaverdeMarin/agentic_reproducibility_assessment。