Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models

Recent proprietary large language models (LLMs), such as GPT-4, have achieved a milestone in tackling diverse challenges in the biomedical domain, ranging from multiple-choice questions to long-form generations. To address challenges that still cannot be handled with the encoded knowledge of LLMs, various retrieval-augmented generation (RAG) methods have been developed by searching documents from the knowledge corpus and appending them unconditionally or selectively to the input of LLMs for generation. However, when applying existing methods to different domain-specific problems, poor generalization becomes apparent, leading to fetching incorrect documents or making inaccurate judgments. In this paper, we introduce Self-BioRAG, a framework reliable for biomedical text that specializes in generating explanations, retrieving domain-specific documents, and self-reflecting generated responses. We utilize 84k filtered biomedical instruction sets to train Self-BioRAG that can assess its generated explanations with customized reflective tokens. Our work proves that domain-specific components, such as a retriever, domain-related document corpus, and instruction sets are necessary for adhering to domain-related instructions. Using three major medical question-answering benchmark datasets, experimental results of Self-BioRAG demonstrate significant performance gains by achieving a 7.2% absolute improvement on average over the state-of-the-art open-foundation model with a parameter size of 7B or less. Overall, we analyze that Self-BioRAG finds the clues in the question, retrieves relevant documents if needed, and understands how to answer with information from retrieved documents and encoded knowledge as a medical expert does. We release our data and code for training our framework components and model weights (7B and 13B) to enhance capabilities in biomedical and clinical domains.

翻译：近期，专有的大型语言模型（如GPT-4）在生物医学领域多项挑战中取得了里程碑式进展，覆盖范围从多项选择题到长文本生成。为应对大型语言模型编码知识仍无法处理的难题，研究者开发了多种检索增强生成方法，通过从知识库中检索文档并将其无条件或有选择性地附加到输入中，辅助模型生成。然而，现有方法应用于不同领域特定问题时，泛化能力显著不足，导致检索错误文档或做出不准确判断。本文提出Self-BioRAG框架，该框架专为生物医学文本设计，擅长生成解释、检索领域文档并自我反思生成结果。我们利用84,000条过滤后的生物医学指令集训练Self-BioRAG，使其能够通过自定义反思标记评估自身生成的解释。研究表明，领域相关组件（如检索器、领域文档库及指令集）对于遵循领域指令至关重要。在三大医学问答基准数据集上的实验结果表明，Self-BioRAG相较参数规模不超过70亿的最先进开源模型，平均绝对性能提升达7.2%。总体而言，我们分析发现Self-BioRAG能像医学专家一样：识别问题中的关键线索，按需检索相关文档，并融合检索文档与编码知识进行推理作答。我们已开源训练框架组件所需的数据、代码及模型权重（7B/13B参数版本），以增强生物医学与临床领域的能力。