Systematic literature reviews (SLRs) are a cornerstone of academic research, yet they are often labour-intensive and time-consuming due to the detailed literature curation process. The advent of generative AI and large language models (LLMs) promises to revolutionize this process by assisting researchers in several tedious tasks, one of them being the generation of effective Boolean queries that will select the publications to consider including in a review. This paper presents an extensive study of Boolean query generation using LLMs for systematic reviews, reproducing and extending the work of Wang et al. and Alaniz et al. Our study investigates the replicability and reliability of results achieved using ChatGPT and compares its performance with open-source alternatives like Mistral and Zephyr to provide a more comprehensive analysis of LLMs for query generation. Therefore, we implemented a pipeline, which automatically creates a Boolean query for a given review topic by using a previously defined LLM, retrieves all documents for this query from the PubMed database and then evaluates the results. With this pipeline we first assess whether the results obtained using ChatGPT for query generation are reproducible and consistent. We then generalize our results by analyzing and evaluating open-source models and evaluating their efficacy in generating Boolean queries. Finally, we conduct a failure analysis to identify and discuss the limitations and shortcomings of using LLMs for Boolean query generation. This examination helps to understand the gaps and potential areas for improvement in the application of LLMs to information retrieval tasks. Our findings highlight the strengths, limitations, and potential of LLMs in the domain of information retrieval and literature review automation.
翻译:系统文献综述是学术研究的基石,但由于文献筛选过程繁琐,通常需要耗费大量人力与时间。生成式人工智能与大型语言模型的兴起,有望通过协助研究人员完成多项繁琐任务来革新这一流程,其中一项关键任务便是生成能有效筛选待纳入综述文献的布尔查询。本文针对系统综述中的布尔查询生成问题,对Wang等人及Alaniz等人的研究进行了复现与拓展,开展了大规模的语言模型应用研究。本研究探讨了使用ChatGPT所得结果的可复现性与可靠性,并将其性能与Mistral、Zephyr等开源替代方案进行比较,从而为语言模型在查询生成任务中的应用提供更全面的分析。为此,我们构建了一个自动化流程:该流程首先基于预设的语言模型为给定综述主题自动生成布尔查询,随后从PubMed数据库检索该查询对应的所有文献,最后对检索结果进行评估。通过该流程,我们首先评估了使用ChatGPT生成查询所得结果的可复现性与一致性。接着,我们通过分析和评估开源模型在布尔查询生成任务中的效能,对研究结论进行了泛化验证。最后,我们开展了故障分析,以识别并讨论使用语言模型生成布尔查询的局限性与不足。这项研究有助于理解语言模型在信息检索任务应用中的缺陷与潜在改进方向。我们的研究结果揭示了语言模型在信息检索与文献综述自动化领域的优势、局限与发展潜力。