Retrieval-augmented generation (RAG) mitigates hallucination in Large Language Models (LLMs) by using query pipelines to retrieve relevant external information and grounding responses in retrieved knowledge. However, query pipeline optimization for cancer patient question-answering (CPQA) systems requires separately optimizing multiple components with domain-specific considerations. We propose a novel three-aspect optimization approach for the RAG query pipeline in CPQA systems, utilizing public biomedical databases like PubMed and PubMed Central. Our optimization includes: (1) document retrieval, utilizing a comparative analysis of NCBI resources and introducing Hybrid Semantic Real-time Document Retrieval (HSRDR); (2) passage retrieval, identifying optimal pairings of dense retrievers and rerankers; and (3) semantic representation, introducing Semantic Enhanced Overlap Segmentation (SEOS) for improved contextual understanding. On a custom-developed dataset tailored for cancer-related inquiries, our optimized RAG approach improved the answer accuracy of Claude-3-haiku by 5.24% over chain-of-thought prompting and about 3% over a naive RAG setup. This study highlights the importance of domain-specific query optimization in realizing the full potential of RAG and provides a robust framework for building more accurate and reliable CPQA systems, advancing the development of RAG-based biomedical systems.
翻译:检索增强生成(RAG)通过利用查询管道检索相关外部信息,并将回答基于检索到的知识,从而缓解大型语言模型(LLM)的幻觉问题。然而,针对癌症患者问答(CPQA)系统的查询管道优化,需要结合领域特定考量,对多个组件分别进行优化。我们提出了一种新颖的三方面优化方法,用于CPQA系统中的RAG查询管道,并利用了如PubMed和PubMed Central等公共生物医学数据库。我们的优化包括:(1)文档检索,利用对NCBI资源的比较分析,并引入混合语义实时文档检索(HSRDR);(2)段落检索,确定稠密检索器与重排序器的最佳配对;(3)语义表示,引入语义增强重叠分割(SEOS)以改进上下文理解。在一个为癌症相关查询量身定制的自定义开发数据集上,我们优化的RAG方法将Claude-3-haiku的答案准确率相比思维链提示提高了5.24%,相比朴素RAG设置提高了约3%。本研究强调了领域特定查询优化在充分释放RAG潜力方面的重要性,并为构建更准确、可靠的CPQA系统提供了一个稳健的框架,推动了基于RAG的生物医学系统的发展。