Large language models (LLMs) excel in tasks requiring processing and interpretation of input text. Abstract screening is a labour-intensive component of systematic review involving repetitive application of inclusion and exclusion criteria on a large volume of studies identified by a literature search. Here, LLMs (GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o, Llama 3 70B, Gemini 1.5 Pro, and Claude Sonnet 3.5) were trialled on systematic reviews in a full issue of the Cochrane Library to evaluate their accuracy in zero-shot binary classification for abstract screening. Trials over a subset of 800 records identified optimal prompting strategies and demonstrated superior performance of LLMs to human researchers in terms of sensitivity (LLMmax = 1.000, humanmax = 0.775), precision (LLMmax = 0.927, humanmax = 0.911), and balanced accuracy (LLMmax = 0.904, humanmax = 0.865). The best performing LLM-prompt combinations were trialled across every replicated search result (n = 119,691), and exhibited consistent sensitivity (range 0.756-1.000) but diminished precision (range 0.004-0.096). 66 LLM-human and LLM-LLM ensembles exhibited perfect sensitivity with a maximal precision of 0.458, with less observed performance drop in larger trials. Significant variation in performance was observed between reviews, highlighting the importance of domain-specific validation before deployment. LLMs may reduce the human labour cost of systematic review with maintained or improved accuracy and sensitivity. Systematic review is the foundation of evidence-based medicine, and LLMs can contribute to increasing the efficiency and quality of this mode of research.
翻译:大语言模型(LLMs)在处理和解释输入文本的任务中表现出色。摘要筛选是系统综述中劳动密集型的环节,涉及对文献检索识别出的大量研究重复应用纳入与排除标准。本研究在Cochrane Library一整期中的系统综述上测试了多种LLMs(GPT-3.5 Turbo、GPT-4 Turbo、GPT-4o、Llama 3 70B、Gemini 1.5 Pro和Claude Sonnet 3.5),以评估其在零样本二分类摘要筛选任务中的准确性。通过对800条记录子集的试验,确定了最优提示策略,并证明LLMs在敏感度(LLMmax = 1.000,humanmax = 0.775)、精确度(LLMmax = 0.927,humanmax = 0.911)和平衡准确率(LLMmax = 0.904,humanmax = 0.865)方面均优于人类研究者。最佳性能的LLM-提示组合在所有重复检索结果(n = 119,691)中进行了测试,表现出稳定的敏感度(范围0.756-1.000)但精确度有所下降(范围0.004-0.096)。66个LLM-人类和LLM-LLM集成模型实现了完美的敏感度(1.000),最高精确度达0.458,且在更大规模的试验中观察到的性能下降较小。不同综述间的性能存在显著差异,凸显了部署前进行领域特异性验证的重要性。大语言模型可在保持或提升准确性与敏感度的同时,降低系统综述的人力成本。系统综述是循证医学的基础,而大语言模型有助于提升此类研究模式的效率与质量。