Large Language Models (LLMs) have demonstrated significant potential in medical Question Answering (QA), yet they remain prone to hallucinations and ungrounded reasoning, limiting their reliability in high-stakes clinical scenarios. While Retrieval-Augmented Generation (RAG) mitigates these issues by incorporating external knowledge, conventional single-shot retrieval often fails to resolve complex biomedical queries requiring multi-step inference. To address this, we propose Self-MedRAG, a self-reflective hybrid framework designed to mimic the iterative hypothesis-verification process of clinical reasoning. Self-MedRAG integrates a hybrid retrieval strategy, combining sparse (BM25) and dense (Contriever) retrievers via Reciprocal Rank Fusion (RRF) to maximize evidence coverage. It employs a generator to produce answers with supporting rationales, which are then assessed by a lightweight self-reflection module using Natural Language Inference (NLI) or LLM-based verification. If the rationale lacks sufficient evidentiary support, the system autonomously reformulates the query and iterates to refine the context. We evaluated Self-MedRAG on the MedQA and PubMedQA benchmarks. The results demonstrate that our hybrid retrieval approach significantly outperforms single-retriever baselines. Furthermore, the inclusion of the self-reflective loop yielded substantial gains, increasing accuracy on MedQA from 80.00% to 83.33% and on PubMedQA from 69.10% to 79.82%. These findings confirm that integrating hybrid retrieval with iterative, evidence-based self-reflection effectively reduces unsupported claims and enhances the clinical reliability of LLM-based systems.
翻译:大型语言模型(LLMs)在医学问答(QA)中展现出巨大潜力,但其仍易产生幻觉和缺乏依据的推理,限制了在高风险临床场景中的可靠性。检索增强生成(RAG)通过引入外部知识来缓解这些问题,但传统的单次检索往往无法解决需要多步推理的复杂生物医学查询。为此,我们提出Self-MedRAG,一种旨在模拟临床推理中迭代式假设-验证过程的自反思混合框架。Self-MedRAG集成了混合检索策略,通过倒数排序融合(RRF)结合稀疏检索器(BM25)和稠密检索器(Contriever),以最大化证据覆盖。它利用生成器产生带有支持性推理的答案,随后由一个轻量级的自反思模块使用自然语言推理(NLI)或基于LLM的验证进行评估。若推理缺乏足够的证据支持,系统会自主重写查询并迭代优化上下文。我们在MedQA和PubMedQA基准上评估了Self-MedRAG。结果表明,我们的混合检索方法显著优于单一检索器的基线。此外,引入自反思循环带来了显著提升,将MedQA的准确率从80.00%提高至83.33%,将PubMedQA的准确率从69.10%提高至79.82%。这些发现证实,将混合检索与基于证据的迭代式自反思相结合,能有效减少无依据的断言,并增强基于LLM系统的临床可靠性。