While Retrieval-Augmented Generation (RAG) has been swiftly adopted in scientific and clinical QA systems, a comprehensive evaluation benchmark in the medical domain is lacking. To address this gap, we introduce the Medical Retrieval-Augmented Generation (MRAG) benchmark, covering various tasks in English and Chinese languages, and building a corpus with Wikipedia and Pubmed. Additionally, we develop the MRAG-Toolkit, facilitating systematic exploration of different RAG components. Our experiments reveal that: (a) RAG enhances LLM reliability across MRAG tasks. (b) the performance of RAG systems is influenced by retrieval approaches, model sizes, and prompting strategies. (c) While RAG improves usefulness and reasoning quality, LLM responses may become slightly less readable for long-form questions. We will release the MRAG-Bench's dataset and toolkit with CCBY-4.0 license upon acceptance, to facilitate applications from both academia and industry.
翻译:尽管检索增强生成(RAG)技术已在科学及临床问答系统中迅速普及,但医学领域仍缺乏全面的评估基准。为填补这一空白,我们提出了医学检索增强生成(MRAG)基准测试,涵盖中英文多种任务类型,并基于维基百科与PubMed构建了专用语料库。同时,我们开发了MRAG工具包,以支持对RAG各组件进行系统性探索。实验结果表明:(a)RAG能有效提升大型语言模型在MRAG各项任务中的可靠性;(b)RAG系统的性能受检索方法、模型规模及提示策略的影响;(c)虽然RAG提升了回答的有效性与推理质量,但针对长问题,大型语言模型生成答案的可读性可能略有下降。我们将在论文录用后以CCBY-4.0协议开源MRAG-Bench数据集与工具包,以促进学术界与工业界的相关应用。