Medical systematic reviews can be very costly and resource intensive. We explore how Large Language Models (LLMs) can support and be trained to perform literature screening when provided with a detailed set of selection criteria. Specifically, we instruction tune LLaMA and Guanaco models to perform abstract screening for medical systematic reviews. Our best model, Bio-SIEVE, outperforms both ChatGPT and trained traditional approaches, and generalises better across medical domains. However, there remains the challenge of adapting the model to safety-first scenarios. We also explore the impact of multi-task training with Bio-SIEVE-Multi, including tasks such as PICO extraction and exclusion reasoning, but find that it is unable to match single-task Bio-SIEVE's performance. We see Bio-SIEVE as an important step towards specialising LLMs for the biomedical systematic review process and explore its future developmental opportunities. We release our models, code and a list of DOIs to reconstruct our dataset for reproducibility.
翻译:医学系统综述往往成本高昂且资源密集。我们探究了在提供详细筛选标准的情况下,大语言模型(LLMs)如何支持并接受训练以进行文献筛选。具体而言,我们对LLaMA和Guanaco模型进行指令微调,使其能够执行医学系统综述的摘要筛选任务。我们的最佳模型Bio-SIEVE在性能上超越了ChatGPT及经过训练的传统方法,并在跨医学领域展现出更优的泛化能力。然而,将模型适配至安全优先场景仍是一项挑战。我们还探索了多任务训练(使用Bio-SIEVE-Multi)的影响,涵盖PICO提取与排除推理等任务,但发现其性能无法匹敌单任务Bio-SIEVE模型。我们认为Bio-SIEVE是推动大语言模型专业化应用于生物医学系统综述流程的重要一步,并探讨了其未来发展前景。我们开源了模型、代码及用于重建数据集的DOI列表,以确保研究可复现。