大型语言模型用于查询生成的可复现性与泛化性研究 (A Reproducibility and Generalizability Study of Large Language Models for Query Generation)

Systematic literature reviews (SLRs) are a cornerstone of academic research, yet they are often labour-intensive and time-consuming due to the detailed literature curation process. The advent of generative AI and large language models (LLMs) promises to revolutionize this process by assisting researchers in several tedious tasks, one of them being the generation of effective Boolean queries that will select the publications to consider including in a review. This paper presents an extensive study of Boolean query generation using LLMs for systematic reviews, reproducing and extending the work of Wang et al. and Alaniz et al. Our study investigates the replicability and reliability of results achieved using ChatGPT and compares its performance with open-source alternatives like Mistral and Zephyr to provide a more comprehensive analysis of LLMs for query generation. Therefore, we implemented a pipeline, which automatically creates a Boolean query for a given review topic by using a previously defined LLM, retrieves all documents for this query from the PubMed database and then evaluates the results. With this pipeline we first assess whether the results obtained using ChatGPT for query generation are reproducible and consistent. We then generalize our results by analyzing and evaluating open-source models and evaluating their efficacy in generating Boolean queries. Finally, we conduct a failure analysis to identify and discuss the limitations and shortcomings of using LLMs for Boolean query generation. This examination helps to understand the gaps and potential areas for improvement in the application of LLMs to information retrieval tasks. Our findings highlight the strengths, limitations, and potential of LLMs in the domain of information retrieval and literature review automation.

翻译：系统文献综述是学术研究的基石，但由于文献筛选过程繁琐，通常需要耗费大量人力与时间。生成式人工智能与大型语言模型的兴起，有望通过协助研究人员完成多项繁琐任务来革新这一流程，其中一项关键任务便是生成能有效筛选待纳入综述文献的布尔查询。本文针对系统综述中的布尔查询生成问题，对Wang等人及Alaniz等人的研究进行了复现与拓展，开展了大规模的语言模型应用研究。本研究探讨了使用ChatGPT所得结果的可复现性与可靠性，并将其性能与Mistral、Zephyr等开源替代方案进行比较，从而为语言模型在查询生成任务中的应用提供更全面的分析。为此，我们构建了一个自动化流程：该流程首先基于预设的语言模型为给定综述主题自动生成布尔查询，随后从PubMed数据库检索该查询对应的所有文献，最后对检索结果进行评估。通过该流程，我们首先评估了使用ChatGPT生成查询所得结果的可复现性与一致性。接着，我们通过分析和评估开源模型在布尔查询生成任务中的效能，对研究结论进行了泛化验证。最后，我们开展了故障分析，以识别并讨论使用语言模型生成布尔查询的局限性与不足。这项研究有助于理解语言模型在信息检索任务应用中的缺陷与潜在改进方向。我们的研究结果揭示了语言模型在信息检索与文献综述自动化领域的优势、局限与发展潜力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日