The lack of high-quality ground truth datasets to train machine learning (ML) models impedes the potential of artificial intelligence (AI) for science research. Scientific information extraction (SIE) from the literature using LLMs is emerging as a powerful approach to automate the creation of these datasets. However, existing LLM-based approaches and benchmarking studies for SIE focus on broad topics such as biomedicine and chemistry, are limited to choice-based tasks, and focus on extracting information from short and well-formatted text. The potential of SIE methods in complex, open-ended tasks is considerably under-explored. In this study, we used a domain that has been virtually ignored in SIE, namely virology, to address these research gaps. We design a unique, open-ended SIE task of extracting mutations in a given virus that modify its interaction with the host. We develop a new, multi-step retrieval augmented generation (RAG) framework called VILLA for SIE. In parallel, we curate a novel dataset of 629 mutations in ten influenza A virus proteins obtained from 239 scientific publications to serve as ground truth for the mutation extraction task. Finally, we demonstrate VILLA's superior performance using a novel and comprehensive evaluation and comparison with vanilla RAG and other state-of-the art RAG- and agent-based tools for SIE.
翻译:高质量真实数据集(ground truth)的缺乏制约了人工智能(AI)在科学研究中的潜力。利用大语言模型(LLM)从文献中进行科学信息提取(SIE),正成为自动化构建此类数据集的强效方法。然而,现有基于LLM的SIE方法与基准研究主要聚焦于生物医学、化学等宏观主题,局限于选择题式任务,且仅从短文本与格式规范的文本中提取信息。SIE方法在复杂开放式任务中的潜力尚未得到充分探索。本研究选取SIE领域长期忽略的病毒学作为研究对象,以填补上述研究空白。我们设计了一项独特的开放式SIE任务:提取特定病毒中能改变其与宿主互作用的突变信息。为此,我们开发了名为VILLA的新型多步骤检索增强生成(RAG)框架,用于SIE任务。与此同时,我们基于239篇科学文献构建了包含流感A病毒十种蛋白质中629个突变的新颖数据集,作为突变提取任务的真实参考。最终,通过创新性的综合评估体系,我们验证了VILLA相较于原始RAG及当前最先进的RAG/智能体工具在SIE任务中的卓越性能。