Retrieval-Augmented Generation (RAG) is a powerful technique for enriching Large Language Models (LLMs) with external knowledge, allowing for factually grounded responses, a critical requirement in high-stakes domains such as healthcare. However, the efficacy of RAG systems is fundamentally restricted by the performance of their retrieval module, since irrelevant or semantically misaligned documents directly compromise the accuracy of the final generated response. General-purpose dense retrievers can struggle with the nuanced language of specialised domains, while the high accuracy of in-domain models is often achieved at prohibitive computational costs. In this work, we aim to address this trade-off by developing and evaluating a two-stage retrieval architecture that combines a lightweight ModernBERT bidirectional encoder for efficient initial candidate retrieval with a ColBERTv2 late-interaction model for fine-grained re-ranking. We conduct comprehensive evaluations of our retriever module performance and RAG system performance in the biomedical context, fine-tuning the IR module using 10k question-passage pairs from PubMedQA. Our analysis of the retriever module confirmed the positive impact of the ColBERT re-ranker, which improved Recall@3 by up to 4.2 percentage points compared to its retrieve-only counterpart. When integrated into the biomedical RAG, our IR module leads to a state-of-the-art average accuracy of 0.4448 on the five tasks of the MIRAGE question-answering benchmark, outperforming strong baselines such as MedCPT (0.4436). Our ablation studies reveal that this performance is critically dependent on a joint fine-tuning process that aligns the retriever and re-ranker; otherwise, the re-ranker might degrade the performance.
翻译:检索增强生成(RAG)是一种为大型语言模型(LLM)注入外部知识的强大技术,使其能够生成基于事实的响应,这在医疗等高风险领域至关重要。然而,RAG系统的效能从根本上受限于其检索模块的性能,因为不相关或语义不匹配的文档会直接损害最终生成响应的准确性。通用密集检索器难以处理专业领域的微妙语言,而领域内模型的高精度往往以极高的计算成本为代价。本研究旨在通过开发并评估一种两阶段检索架构来解决这一权衡问题:该架构结合了轻量级ModernBERT双向编码器以实现高效的初始候选检索,以及ColBERTv2延迟交互模型以实现细粒度重排序。我们在生物医学背景下对检索器模块性能及RAG系统性能进行了全面评估,使用PubMedQA中的1万个问题-段落对微调了信息检索模块。对检索器模块的分析证实了ColBERT重排序器的积极影响,与仅检索版本相比,其Recall@3最高提升了4.2个百分点。当集成到生物医学RAG系统中时,我们的信息检索模块在MIRAGE问答基准的五项任务上实现了0.4448的先进平均准确率,超越了MedCPT(0.4436)等强基线。消融研究表明,该性能关键依赖于对齐检索器与重排序器的联合微调过程;否则重排序器可能降低系统性能。