This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews. Despite the recent surge of interest in LLMs there is still a lack of understanding of how to design LLM-based automation tools and how to robustly evaluate their performance. During the 2023 Evidence Synthesis Hackathon we conducted two feasibility studies. Firstly, to automatically extract study characteristics from human clinical, animal, and social science domain studies. We used two studies from each category for prompt-development; and ten for evaluation. Secondly, we used the LLM to predict Participants, Interventions, Controls and Outcomes (PICOs) labelled within 100 abstracts in the EBM-NLP dataset. Overall, results indicated an accuracy of around 80%, with some variability between domains (82% for human clinical, 80% for animal, and 72% for studies of human social sciences). Causal inference methods and study design were the data extraction items with the most errors. In the PICO study, participants and intervention/control showed high accuracy (>80%), outcomes were more challenging. Evaluation was done manually; scoring methods such as BLEU and ROUGE showed limited value. We observed variability in the LLMs predictions and changes in response quality. This paper presents a template for future evaluations of LLMs in the context of data extraction for systematic review automation. Our results show that there might be value in using LLMs, for example as second or third reviewers. However, caution is advised when integrating models such as GPT-4 into tools. Further research on stability and reliability in practical settings is warranted for each type of data that is processed by the LLM.
翻译:本文描述了一项使用GPT-4(一种大型语言模型)在系统综述中实现数据(半)自动化提取的快速可行性研究。尽管近期对LLMs的关注度激增,但关于如何设计基于LLM的自动化工具以及如何稳健评估其性能仍缺乏深入理解。在2023年证据综合黑客松期间,我们开展了两项可行性研究:首先,从人类临床、动物和社会科学领域研究中自动提取研究特征。我们使用每个类别的两项研究进行提示开发,并选取十项研究进行评估。其次,我们利用LLM预测EBM-NLP数据集中100篇摘要内标注的参与者、干预措施、对照和结局(PICO)。总体结果显示准确率约为80%,不同领域间存在差异(人类临床研究82%、动物研究80%、人类社会科学研究72%)。因果推断方法和研究设计是错误率最高的数据提取项。在PICO研究中,参与者和干预/对照显示出较高准确率(>80%),结局提取更具挑战性。评估采用人工方式进行;BLEU和ROUGE等评分方法显示出有限价值。我们观察到LLM预测结果存在波动性及响应质量的变化。本文为未来在系统综述自动化数据提取背景下评估LLMs提供了模板框架。研究结果表明,使用LLMs可能具有应用价值,例如作为第二或第三评审员。然而,在将GPT-4等模型集成到工具中时需保持谨慎。针对LLM处理的每种数据类型,需要在实际应用场景中进一步研究其稳定性和可靠性。