Self-MedRAG：一种用于可靠医学问答的自反思混合检索增强生成框架 (Self-MedRAG: a Self-Reflective Hybrid Retrieval-Augmented Generation Framework for Reliable Medical Question Answering)

Large Language Models (LLMs) have demonstrated significant potential in medical Question Answering (QA), yet they remain prone to hallucinations and ungrounded reasoning, limiting their reliability in high-stakes clinical scenarios. While Retrieval-Augmented Generation (RAG) mitigates these issues by incorporating external knowledge, conventional single-shot retrieval often fails to resolve complex biomedical queries requiring multi-step inference. To address this, we propose Self-MedRAG, a self-reflective hybrid framework designed to mimic the iterative hypothesis-verification process of clinical reasoning. Self-MedRAG integrates a hybrid retrieval strategy, combining sparse (BM25) and dense (Contriever) retrievers via Reciprocal Rank Fusion (RRF) to maximize evidence coverage. It employs a generator to produce answers with supporting rationales, which are then assessed by a lightweight self-reflection module using Natural Language Inference (NLI) or LLM-based verification. If the rationale lacks sufficient evidentiary support, the system autonomously reformulates the query and iterates to refine the context. We evaluated Self-MedRAG on the MedQA and PubMedQA benchmarks. The results demonstrate that our hybrid retrieval approach significantly outperforms single-retriever baselines. Furthermore, the inclusion of the self-reflective loop yielded substantial gains, increasing accuracy on MedQA from 80.00% to 83.33% and on PubMedQA from 69.10% to 79.82%. These findings confirm that integrating hybrid retrieval with iterative, evidence-based self-reflection effectively reduces unsupported claims and enhances the clinical reliability of LLM-based systems.

翻译：大型语言模型（LLMs）在医学问答（QA）中展现出巨大潜力，但其仍易产生幻觉和缺乏依据的推理，限制了在高风险临床场景中的可靠性。检索增强生成（RAG）通过引入外部知识来缓解这些问题，但传统的单次检索往往无法解决需要多步推理的复杂生物医学查询。为此，我们提出Self-MedRAG，一种旨在模拟临床推理中迭代式假设-验证过程的自反思混合框架。Self-MedRAG集成了混合检索策略，通过倒数排序融合（RRF）结合稀疏检索器（BM25）和稠密检索器（Contriever），以最大化证据覆盖。它利用生成器产生带有支持性推理的答案，随后由一个轻量级的自反思模块使用自然语言推理（NLI）或基于LLM的验证进行评估。若推理缺乏足够的证据支持，系统会自主重写查询并迭代优化上下文。我们在MedQA和PubMedQA基准上评估了Self-MedRAG。结果表明，我们的混合检索方法显著优于单一检索器的基线。此外，引入自反思循环带来了显著提升，将MedQA的准确率从80.00%提高至83.33%，将PubMedQA的准确率从69.10%提高至79.82%。这些发现证实，将混合检索与基于证据的迭代式自反思相结合，能有效减少无依据的断言，并增强基于LLM系统的临床可靠性。