Systematic reviews are vital for guiding practice, research, and policy, yet they are often slow and labour-intensive. Large language models (LLMs) could offer a way to speed up and automate systematic reviews, but their performance in such tasks has not been comprehensively evaluated against humans, and no study has tested GPT-4, the biggest LLM so far. This pre-registered study evaluates GPT-4's capability in title/abstract screening, full-text review, and data extraction across various literature types and languages using a 'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human performance in most tasks, results were skewed by chance agreement and dataset imbalance. After adjusting for these, there was a moderate level of performance for data extraction, and - barring studies that used highly reliable prompts - screening performance levelled at none to moderate for different stages and languages. When screening full-text literature using highly reliable prompts, GPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key studies using highly reliable prompts improved its performance even more. Our findings indicate that, currently, substantial caution should be used if LLMs are being used to conduct systematic reviews, but suggest that, for certain systematic review tasks delivered under reliable prompts, LLMs can rival human performance.
翻译:系统综述对于指导实践、研究与政策制定至关重要,但其过程通常耗时且劳动密集。大型语言模型(LLMs)有望加速并自动化系统综述流程,然而其在相关任务中的表现尚未与人类进行全面比较,且尚无研究检验迄今最大的LLM——GPT-4。本预注册研究采用“人机分离”方法,评估GPT-4在标题/摘要筛选、全文审阅及跨文献类型与语言的数据提取能力。尽管GPT-4在多数任务中准确率与人类表现相当,但结果受偶然一致性及数据集不平衡影响而出现偏差。校正这些因素后,模型在数据提取任务中表现中等,而在排除使用高可靠性提示的研究后,其在各阶段及不同语言中的筛选性能介于“无”至“中等”水平。当使用高可靠性提示进行全文筛选时,GPT-4的表现达到“近乎完美”等级。通过高可靠性提示对GPT-4遗漏关键研究的行为施加惩罚后,其性能进一步提升。我们的发现表明:当前若将LLMs应用于系统综述,需持高度谨慎态度;但在可靠提示引导下,LLMs在特定系统综述任务中可媲美人类表现。