Objective: To assess the performance of the OpenAI GPT API in accurately and efficiently identifying relevant titles and abstracts from real-world clinical review datasets and compare its performance against ground truth labelling by two independent human reviewers. Methods: We introduce a novel workflow using the OpenAI GPT API for screening titles and abstracts in clinical reviews. A Python script was created to make calls to the GPT API with the screening criteria in natural language and a corpus of title and abstract datasets that have been filtered by a minimum of two human reviewers. We compared the performance of our model against human-reviewed papers across six review papers, screening over 24,000 titles and abstracts. Results: Our results show an accuracy of 0.91, a sensitivity of excluded papers of 0.91, and a sensitivity of included papers of 0.76. On a randomly selected subset of papers, the GPT API demonstrated the ability to provide reasoning for its decisions and corrected its initial decision upon being asked to explain its reasoning for a subset of incorrect classifications. Conclusion: The GPT API has the potential to streamline the clinical review process, save valuable time and effort for researchers, and contribute to the overall quality of clinical reviews. By prioritizing the workflow and acting as an aid rather than a replacement for researchers and reviewers, the GPT API can enhance efficiency and lead to more accurate and reliable conclusions in medical research.
翻译:目的:评估OpenAI GPT API在准确、高效识别真实临床综述数据集中相关标题与摘要方面的性能,并将其与两位独立人工评审者的金标准标注结果进行对比。方法:我们提出一种创新工作流程,采用OpenAI GPT API对临床综述的标题及摘要进行筛选。通过编写Python脚本,以自然语言形式向GPT API传递筛选标准,并使用经至少两位人工评审者筛选过的标题与摘要数据集作为语料。我们针对六篇综述论文中的超过24000条标题及摘要,将模型性能与人工评审结果进行对比。结果:结果显示总准确率为0.91,排除文献敏感度为0.91,纳入文献敏感度为0.76。在对随机选取的子集分析中发现,GPT API能够对其决策提供推理说明,并在被要求解释错误分类子集的推理过程后纠正其初始判断。结论:GPT API具备简化临床综述流程的潜力,可为研究者节省宝贵时间与精力,并提升临床综述的整体质量。通过优化工作流程并作为研究者与评审者的辅助工具(而非替代品),GPT API能够提高效率,推动医学研究得出更准确、可靠的结论。