Evaluating AI-based Scientific Knowledge Synthesis with Epidemiological Systematic Reviews

Shreyansh Padarha,Ryan Othniel Kearns,Tristan Naidoo,Lingyi Yang,Łukasz Borchmann,Piotr BŁaszczyk,Christian Morgenstern,Ruth McCabe,Sangeeta Bhatia,Philip H. Torr,Jakob Foerster,Scott A. Hale,Thomas Rawson,Anne Cori,Elizaveta Semenova,Adam Mahdi

Systematic literature reviews (SLRs) are a demanding and high-stakes form of scientific knowledge synthesis that remains underspecified as an evaluation setting for large language models (LLMs). We introduce AgentSLR, a large-scale evaluation harness comprising an SLR automation workflow and an expert annotated dataset covering 16,248 articles, designed to test LLM capabilities across the stages of SLRs in epidemiology. Reference annotations were derived from peer-reviewed studies on WHO priority pathogens and produced by domain experts. The harness evaluates each review stage as a separate unit with dedicated metrics enabling targeted failure analysis. We evaluated five frontier reasoning models and found that no single model dominated across all tasks, showing sub-task specialisation often hidden by aggregate benchmarks. Structured data extraction is a major bottleneck, with no model exceeding an average field-level F1 of 0.67. Estimated costs vary substantially, by up to 96 times across evaluated models. Documented failure modes suggest that the evaluated models are not yet reliable enough for unsupervised deployment in epidemiology, where findings can inform public policy.

翻译：系统文献综述（SLRs）是一种高要求、高风险的科学知识综合形式，但其作为大语言模型（LLMs）评估场景的规范仍不充分。我们提出AgentSLR——一个大规模评估框架，包含SLR自动化工作流程和覆盖16,248篇文献的专家标注数据集，旨在测试LLM在流行病学SLR各阶段的能力。参考标注源自经同行评审的WHO优先病原体研究，并由领域专家生成。该框架将每个综述阶段作为独立单元评估，采用专用指标支持针对性失败分析。我们对五个前沿推理模型进行评估后发现，没有任何单一模型在所有任务中占据主导地位，这揭示了常被聚合基准隐藏的子任务专业化现象。结构化数据提取是主要瓶颈，所有模型的平均字段级F1值均未超过0.67。评估模型的预估成本差异显著，最大达96倍。记录的失败模式表明，当前评估模型尚不足以在流行病学领域实现无监督部署——该领域的研究发现可直接影响公共卫生政策制定。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

基于大语言模型的医疗推理研究：综述与 MR-Bench 基准测试

专知会员服务

16+阅读 · 4月13日

评估大语言模型在科学发现中的作用

专知会员服务

19+阅读 · 2025年12月19日

基于强化学习的智能体化搜索全面综述：基础、角色、优化、评估与应用

专知会员服务

23+阅读 · 2025年10月22日

科学大语言模型综述：从数据基础到智能体前沿

专知会员服务

51+阅读 · 2025年9月1日