The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic Reviews

Systematic review (SR) is a popular research method in software engineering (SE). However, conducting an SR takes an average of 67 weeks. Thus, automating any step of the SR process could reduce the effort associated with SRs. Our objective is to investigate if Large Language Models (LLMs) can accelerate title-abstract screening by simplifying abstracts for human screeners, and automating title-abstract screening. We performed an experiment where humans screened titles and abstracts for 20 papers with both original and simplified abstracts from a prior SR. The experiment with human screeners was reproduced with GPT-3.5 and GPT-4 LLMs to perform the same screening tasks. We also studied if different prompting techniques (Zero-shot (ZS), One-shot (OS), Few-shot (FS), and Few-shot with Chain-of-Thought (FS-CoT)) improve the screening performance of LLMs. Lastly, we studied if redesigning the prompt used in the LLM reproduction of screening leads to improved performance. Text simplification did not increase the screeners' screening performance, but reduced the time used in screening. Screeners' scientific literacy skills and researcher status predict screening performance. Some LLM and prompt combinations perform as well as human screeners in the screening tasks. Our results indicate that the GPT-4 LLM is better than its predecessor, GPT-3.5. Additionally, Few-shot and One-shot prompting outperforms Zero-shot prompting. Using LLMs for text simplification in the screening process does not significantly improve human performance. Using LLMs to automate title-abstract screening seems promising, but current LLMs are not significantly more accurate than human screeners. To recommend the use of LLMs in the screening process of SRs, more research is needed. We recommend future SR studies publish replication packages with screening data to enable more conclusive experimenting with LLM screening.

翻译：系统评价（SR）是软件工程（SE）领域一种常用的研究方法，但完成一项系统评价平均需要67周。因此，自动化系统评价流程中的任何环节均可降低相关工作成本。本研究旨在探究大语言模型（LLM）能否通过为人工筛选者简化摘要以及自动化标题-摘要筛选来加速该过程。我们开展了一项实验，基于前期系统评价的原始摘要与简化摘要，让人工筛选者对20篇论文的标题和摘要进行筛选。同时利用GPT-3.5和GPT-4大语言模型复现人工筛选者所执行的相同筛选任务。我们进一步研究了不同提示技术（零样本（ZS）、单样本（OS）、少样本（FS）及少样本结合思维链（FS-CoT））对LLM筛选性能的改进效果，并探究了在复现筛选的LLM中重新设计提示是否能提升性能。摘要简化未提高筛选者的筛选性能，但缩短了筛选用时。筛选者的科学素养能力与研究者身份可预测其筛选表现。部分LLM与提示组合在筛选任务中的表现与人工筛选者相当。结果显示GPT-4大语言模型优于其前代GPT-3.5，且少样本与单样本提示优于零样本提示。在筛选过程中使用LLM进行摘要简化未能显著提升人工表现，而利用LLM自动化标题-摘要筛选虽具潜力，但当前LLM的准确率尚未显著超过人工筛选者。为推荐在系统评价筛选过程中使用LLM，尚需开展更多研究。我们建议未来系统评价研究应发布附带筛选数据的可复现工具包，以支持更具结论性的LLM筛选实验。