Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.
翻译:荟萃分析是一种高要求的循证综合方法,它整合了文献检索、基于PI/ECO的研究筛选与统计聚合。其结构化、可验证的工作流程使其成为评估系统性科学推理的理想载体,然而现有基准缺乏贯穿检索-筛选-综合全流程的基准真值。我们提出MetaSyn数据集,包含442篇来自《自然》系列期刊的专家精选荟萃分析。每条数据配对研究问题与PI/ECO标准、包含14万篇PubMed文章的检索语料库、已验证的阳性研究、主题相似但不符合PI/ECO要求的硬负样本,以及完整的检索策略与日期范围。对十二种流水线配置(九种RAG变体与一种协议驱动智能体)的基准测试揭示了一个关键筛选瓶颈:尽管在K=200时检索上限达到90.9%的召回率,但没有任何系统能恢复超过52.7%的基准真值纳入文献。当前大语言模型在主题相关性相当的研究库中,无法可靠区分合格研究与不符合PI/ECO的干扰项。按阶段归因的评估指标能捕捉系统成功与失败的具体环节——单一端到端评分则无法实现。