Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) in knowledge-intensive tasks such as those from medical domain. However, the sensitive nature of the medical domain necessitates a completely accurate and trustworthy system. While existing RAG benchmarks primarily focus on the standard retrieve-answer setting, they overlook many practical scenarios that measure crucial aspects of a reliable medical system. This paper addresses this gap by providing a comprehensive evaluation framework for medical question-answering (QA) systems in a RAG setting for these situations, including sufficiency, integration, and robustness. We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets for testing LLMs' ability to handle these specific scenarios. Utilizing MedRGB, we conduct extensive evaluations of both state-of-the-art commercial LLMs and open-source models across multiple retrieval conditions. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents. We further analyze the LLMs' reasoning processes to provides valuable insights and future directions for developing RAG systems in this critical medical domain.
翻译:检索增强生成(RAG)已成为提升大型语言模型(LLMs)在医疗领域等知识密集型任务中性能的一种有前景的方法。然而,医疗领域的敏感性要求系统必须完全准确且可信赖。尽管现有的RAG基准主要关注标准的检索-回答设置,但它们忽略了许多衡量可靠医疗系统关键方面的实际场景。本文通过为这些场景(包括充分性、整合性和鲁棒性)提供一个RAG设置下的医疗问答系统全面评估框架,以弥补这一空白。我们引入了医疗检索增强生成基准(MedRGB),该基准为四个医疗问答数据集提供了多种补充要素,用于测试LLMs处理这些特定场景的能力。利用MedRGB,我们在多种检索条件下对最先进的商业LLMs和开源模型进行了广泛评估。我们的实验结果表明,当前模型处理检索文档中的噪声和错误信息的能力有限。我们进一步分析了LLMs的推理过程,为在这一关键的医疗领域开发RAG系统提供了有价值的见解和未来方向。