In academic research, systematic literature reviews are foundational and highly relevant, yet tedious to create due to the high volume of publications and labor-intensive processes involved. Systematic selection of relevant papers through conventional means like keyword-based filtering techniques can sometimes be inadequate, plagued by semantic ambiguities and inconsistent terminology, which can lead to sub-optimal outcomes. To mitigate the required extensive manual filtering, we explore and evaluate the potential of using Large Language Models (LLMs) to enhance the efficiency, speed, and precision of literature review filtering, reducing the amount of manual screening required. By using models as classification agents acting on a structured database only, we prevent common problems inherent in LLMs, such as hallucinations. We evaluate the real-world performance of such a setup during the construction of a recent literature survey paper with initially more than 8.3k potentially relevant articles under consideration and compare this with human performance on the same dataset. Our findings indicate that employing advanced LLMs like GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Flash, or Llama3 with simple prompting can significantly reduce the time required for literature filtering - from usually weeks of manual research to only a few minutes. Simultaneously, we crucially show that false negatives can indeed be controlled through a consensus scheme, achieving recalls >98.8% at or even beyond the typical human error threshold, thereby also providing for more accurate and relevant articles selected. Our research not only demonstrates a substantial improvement in the methodology of literature reviews but also sets the stage for further integration and extensive future applications of responsible AI in academic research practices.
翻译:在学术研究中,系统文献综述具有基础性且高度相关,但由于出版物数量庞大且过程劳动密集,其编制工作往往繁琐耗时。通过基于关键词过滤等传统方法进行相关论文的系统筛选有时并不充分,常受语义模糊和术语不一致的困扰,可能导致次优结果。为减少所需的大量人工筛选工作,我们探索并评估了利用大语言模型提升文献综述筛选效率、速度和准确性的潜力,从而降低人工筛查的工作量。通过仅将模型作为在结构化数据库上运行的分类代理,我们避免了LLM固有的常见问题,如幻觉。我们在构建一篇近期文献综述论文的过程中评估了该方案的实际性能:初始考虑的相关文献超过8300篇,并将其与人工在同一数据集上的表现进行比较。研究结果表明,采用GPT-4o、Claude 3.5 Sonnet、Gemini 1.5 Flash或Llama3等先进LLM配合简单提示,可显著缩短文献筛选时间——从通常数周的人工研究减少至仅需数分钟。同时,我们关键性地证明了通过共识机制可有效控制假阴性,在达到甚至超越典型人工误差阈值的条件下实现>98.8%的召回率,从而筛选出更准确、更相关的文献。本研究不仅展示了文献综述方法的实质性改进,也为负责任的人工智能在学术研究实践中的进一步整合及广泛未来应用奠定了基础。