The question-answering system for Life science research, which is characterized by the rapid pace of discovery, evolving insights, and complex interactions among knowledge entities, presents unique challenges in maintaining a comprehensive knowledge warehouse and accurate information retrieval. To address these issues, we introduce BioRAG, a novel Retrieval-Augmented Generation (RAG) with the Large Language Models (LLMs) framework. Our approach starts with parsing, indexing, and segmenting an extensive collection of 22 million scientific papers as the basic knowledge, followed by training a specialized embedding model tailored to this domain. Additionally, we enhance the vector retrieval process by incorporating a domain-specific knowledge hierarchy, which aids in modeling the intricate interrelationships among each query and context. For queries requiring the most current information, BioRAG deconstructs the question and employs an iterative retrieval process incorporated with the search engine for step-by-step reasoning. Rigorous experiments have demonstrated that our model outperforms fine-tuned LLM, LLM with search engines, and other scientific RAG frameworks across multiple life science question-answering tasks.
翻译:生命科学研究的问答系统具有发现速度快、见解不断演进、知识实体间交互复杂等特点,在维护全面知识库和实现精确信息检索方面面临独特挑战。为应对这些问题,我们提出了BioRAG——一种基于大语言模型(LLMs)的新型检索增强生成(RAG)框架。该方法首先对涵盖2200万篇科学文献的庞大集合进行解析、索引与切分,以此构建基础知识体系;随后训练针对该领域定制的专用嵌入模型。此外,我们通过融入领域特定的知识层级结构来增强向量检索过程,该结构有助于建模每个查询与上下文之间复杂的相互关系。对于需要最新信息的查询,BioRAG将问题解构并采用结合搜索引擎的迭代检索流程,实现逐步推理。严格实验表明,在多项生命科学问答任务中,我们的模型性能优于微调LLM、结合搜索引擎的LLM以及其他科学RAG框架。