Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study

from arxiv, Conference proceedings, peer-reviewed and presented at the 3rd Workshop on Augmented Intelligence for Technology-Assisted Reviews Systems, Glasgow, 2024

This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews. Despite the recent surge of interest in LLMs there is still a lack of understanding of how to design LLM-based automation tools and how to robustly evaluate their performance. During the 2023 Evidence Synthesis Hackathon we conducted two feasibility studies. Firstly, to automatically extract study characteristics from human clinical, animal, and social science domain studies. We used two studies from each category for prompt-development; and ten for evaluation. Secondly, we used the LLM to predict Participants, Interventions, Controls and Outcomes (PICOs) labelled within 100 abstracts in the EBM-NLP dataset. Overall, results indicated an accuracy of around 80%, with some variability between domains (82% for human clinical, 80% for animal, and 72% for studies of human social sciences). Causal inference methods and study design were the data extraction items with the most errors. In the PICO study, participants and intervention/control showed high accuracy (>80%), outcomes were more challenging. Evaluation was done manually; scoring methods such as BLEU and ROUGE showed limited value. We observed variability in the LLMs predictions and changes in response quality. This paper presents a template for future evaluations of LLMs in the context of data extraction for systematic review automation. Our results show that there might be value in using LLMs, for example as second or third reviewers. However, caution is advised when integrating models such as GPT-4 into tools. Further research on stability and reliability in practical settings is warranted for each type of data that is processed by the LLM.

翻译：本文描述了一项使用GPT-4（一种大型语言模型）在系统综述中实现数据（半）自动化提取的快速可行性研究。尽管近期对LLMs的关注度激增，但关于如何设计基于LLM的自动化工具以及如何稳健评估其性能仍缺乏深入理解。在2023年证据综合黑客松期间，我们开展了两项可行性研究：首先，从人类临床、动物和社会科学领域研究中自动提取研究特征。我们使用每个类别的两项研究进行提示开发，并选取十项研究进行评估。其次，我们利用LLM预测EBM-NLP数据集中100篇摘要内标注的参与者、干预措施、对照和结局（PICO）。总体结果显示准确率约为80%，不同领域间存在差异（人类临床研究82%、动物研究80%、人类社会科学研究72%）。因果推断方法和研究设计是错误率最高的数据提取项。在PICO研究中，参与者和干预/对照显示出较高准确率（>80%），结局提取更具挑战性。评估采用人工方式进行；BLEU和ROUGE等评分方法显示出有限价值。我们观察到LLM预测结果存在波动性及响应质量的变化。本文为未来在系统综述自动化数据提取背景下评估LLMs提供了模板框架。研究结果表明，使用LLMs可能具有应用价值，例如作为第二或第三评审员。然而，在将GPT-4等模型集成到工具中时需保持谨慎。针对LLM处理的每种数据类型，需要在实际应用场景中进一步研究其稳定性和可靠性。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日