Systematic reviews are vital for guiding practice, research, and policy, yet they are often slow and labour-intensive. Large language models (LLMs) could offer a way to speed up and automate systematic reviews, but their performance in such tasks has not been comprehensively evaluated against humans, and no study has tested GPT-4, the biggest LLM so far. This pre-registered study evaluates GPT-4's capability in title/abstract screening, full-text review, and data extraction across various literature types and languages using a 'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human performance in most tasks, results were skewed by chance agreement and dataset imbalance. After adjusting for these, there was a moderate level of performance for data extraction, and - barring studies that used highly reliable prompts - screening performance levelled at none to moderate for different stages and languages. When screening full-text literature using highly reliable prompts, GPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key studies using highly reliable prompts improved its performance even more. Our findings indicate that, currently, substantial caution should be used if LLMs are being used to conduct systematic reviews, but suggest that, for certain systematic review tasks delivered under reliable prompts, LLMs can rival human performance.
翻译:系统综述对于指导实践、研究与政策制定至关重要,但往往耗时费力。大型语言模型(LLMs)或可提供加速并自动化系统综述的途径,然而其在此类任务中的表现尚未与人类进行过全面评估,且尚无研究测试过目前最大的LLM——GPT-4。本项预注册研究采用“脱离人工干预”方法,评估了GPT-4在标题/摘要筛选、全文审查及跨文献类型与语言的数据提取中的能力。尽管GPT-4在多数任务中的准确率与人类水平相当,但结果受偶然一致性及数据集不平衡的影响而产生偏差。在调整这些因素后,数据提取性能达到中等水平;除使用高可靠性提示的研究外,不同阶段与语言的筛选性能介于无至中等之间。当采用高可靠性提示筛选全文文献时,GPT-4的性能达到“接近完美”水平。对使用高可靠性提示时遗漏关键研究的GPT-4进行惩罚性评估,进一步提升了其表现。我们的研究结果表明,当前若将LLMs应用于系统综述,需保持高度审慎;但提示在特定可靠提示下开展的系统综述任务中,LLMs可达到与人类竞争的水平。