基于大语言模型集成的高性能自动化摘要筛选 (High-performance automated abstract screening with large language model ensembles)

Rohan Sanghera,Arun James Thirunavukarasu,Marc El Khoury,Jessica O'Logbon,Yuqing Chen,Archie Watt,Mustafa Mahmood,Hamid Butt,George Nishimura,Andrew Soltan

from arxiv, RS and AJT are joint-first authors

Large language models (LLMs) excel in tasks requiring processing and interpretation of input text. Abstract screening is a labour-intensive component of systematic review involving repetitive application of inclusion and exclusion criteria on a large volume of studies identified by a literature search. Here, LLMs (GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o, Llama 3 70B, Gemini 1.5 Pro, and Claude Sonnet 3.5) were trialled on systematic reviews in a full issue of the Cochrane Library to evaluate their accuracy in zero-shot binary classification for abstract screening. Trials over a subset of 800 records identified optimal prompting strategies and demonstrated superior performance of LLMs to human researchers in terms of sensitivity (LLMmax = 1.000, humanmax = 0.775), precision (LLMmax = 0.927, humanmax = 0.911), and balanced accuracy (LLMmax = 0.904, humanmax = 0.865). The best performing LLM-prompt combinations were trialled across every replicated search result (n = 119,691), and exhibited consistent sensitivity (range 0.756-1.000) but diminished precision (range 0.004-0.096). 66 LLM-human and LLM-LLM ensembles exhibited perfect sensitivity with a maximal precision of 0.458, with less observed performance drop in larger trials. Significant variation in performance was observed between reviews, highlighting the importance of domain-specific validation before deployment. LLMs may reduce the human labour cost of systematic review with maintained or improved accuracy and sensitivity. Systematic review is the foundation of evidence-based medicine, and LLMs can contribute to increasing the efficiency and quality of this mode of research.

翻译：大语言模型（LLMs）在处理和解释输入文本的任务中表现出色。摘要筛选是系统综述中劳动密集型的环节，涉及对文献检索识别出的大量研究重复应用纳入与排除标准。本研究在Cochrane Library一整期中的系统综述上测试了多种LLMs（GPT-3.5 Turbo、GPT-4 Turbo、GPT-4o、Llama 3 70B、Gemini 1.5 Pro和Claude Sonnet 3.5），以评估其在零样本二分类摘要筛选任务中的准确性。通过对800条记录子集的试验，确定了最优提示策略，并证明LLMs在敏感度（LLMmax = 1.000，humanmax = 0.775）、精确度（LLMmax = 0.927，humanmax = 0.911）和平衡准确率（LLMmax = 0.904，humanmax = 0.865）方面均优于人类研究者。最佳性能的LLM-提示组合在所有重复检索结果（n = 119,691）中进行了测试，表现出稳定的敏感度（范围0.756-1.000）但精确度有所下降（范围0.004-0.096）。66个LLM-人类和LLM-LLM集成模型实现了完美的敏感度（1.000），最高精确度达0.458，且在更大规模的试验中观察到的性能下降较小。不同综述间的性能存在显著差异，凸显了部署前进行领域特异性验证的重要性。大语言模型可在保持或提升准确性与敏感度的同时，降低系统综述的人力成本。系统综述是循证医学的基础，而大语言模型有助于提升此类研究模式的效率与质量。