By organizing knowledge within a research field, Systematic Reviews (SR) provide valuable leads to steer research. Evidence suggests that SRs have become first-class artifacts in software engineering. However, the tedious manual effort associated with the screening phase of SRs renders these studies a costly and error-prone endeavor. While screening has traditionally been considered not amenable to automation, the advent of generative AI-driven chatbots, backed with large language models is set to disrupt the field. In this report, we propose an approach to leverage these novel technological developments for automating the screening of SRs. We assess the consistency, classification performance, and generalizability of ChatGPT in screening articles for SRs and compare these figures with those of traditional classifiers used in SR automation. Our results indicate that ChatGPT is a viable option to automate the SR processes, but requires careful considerations from developers when integrating ChatGPT into their SR tools.
翻译:系统评价(Systematic Reviews, SRs)通过组织研究领域的知识,为引导研究提供有价值的线索。有证据表明,系统评价已成为软件工程中的顶级文献类型。然而,系统评价筛选阶段繁琐的人工操作使这类研究成本高昂且容易出错。尽管传统上认为筛选工作不适合自动化,但由大型语言模型支持、以生成式AI驱动的聊天机器人的出现将颠覆该领域。在本报告中,我们提出了一种利用这些新技术发展来自动化系统评价筛选的方法。我们评估了ChatGPT在筛选系统评价文章时的一致性、分类性能和泛化能力,并将这些指标与系统评价自动化中使用的传统分类器进行了比较。结果表明,ChatGPT是自动化系统评价流程的可行选择,但开发人员在将ChatGPT集成到其系统评价工具时需要仔细考量。