Systematic literature reviews (SLRs) are a demanding and high-stakes form of scientific knowledge synthesis that remains underspecified as an evaluation setting for large language models (LLMs). We introduce AgentSLR, a large-scale evaluation harness comprising an SLR automation workflow and an expert annotated dataset covering 16,248 articles, designed to test LLM capabilities across the stages of SLRs in epidemiology. Reference annotations were derived from peer-reviewed studies on WHO priority pathogens and produced by domain experts. The harness evaluates each review stage as a separate unit with dedicated metrics enabling targeted failure analysis. We evaluated five frontier reasoning models and found that no single model dominated across all tasks, showing sub-task specialisation often hidden by aggregate benchmarks. Structured data extraction is a major bottleneck, with no model exceeding an average field-level F1 of 0.67. Estimated costs vary substantially, by up to 96 times across evaluated models. Documented failure modes suggest that the evaluated models are not yet reliable enough for unsupervised deployment in epidemiology, where findings can inform public policy.
翻译:系统文献综述(SLRs)是一种高要求、高风险的科学知识综合形式,但其作为大语言模型(LLMs)评估场景的规范仍不充分。我们提出AgentSLR——一个大规模评估框架,包含SLR自动化工作流程和覆盖16,248篇文献的专家标注数据集,旨在测试LLM在流行病学SLR各阶段的能力。参考标注源自经同行评审的WHO优先病原体研究,并由领域专家生成。该框架将每个综述阶段作为独立单元评估,采用专用指标支持针对性失败分析。我们对五个前沿推理模型进行评估后发现,没有任何单一模型在所有任务中占据主导地位,这揭示了常被聚合基准隐藏的子任务专业化现象。结构化数据提取是主要瓶颈,所有模型的平均字段级F1值均未超过0.67。评估模型的预估成本差异显著,最大达96倍。记录的失败模式表明,当前评估模型尚不足以在流行病学领域实现无监督部署——该领域的研究发现可直接影响公共卫生政策制定。