While Retrieval-Augmented Generation (RAG) has been swiftly adopted in scientific and clinical QA systems, a comprehensive evaluation benchmark in the medical domain is lacking. To address this gap, we introduce the Medical Retrieval-Augmented Generation (MRAG) benchmark, covering various tasks in English and Chinese languages, and building a corpus with Wikipedia and Pubmed. Additionally, we develop the MRAG-Toolkit, facilitating systematic exploration of different RAG components. Our experiments reveal that: (a) RAG enhances LLM reliability across MRAG tasks. (b) the performance of RAG systems is influenced by retrieval approaches, model sizes, and prompting strategies. (c) While RAG improves usefulness and reasoning quality, LLM responses may become slightly less readable for long-form questions. We will release the MRAG-Bench's dataset and toolkit with CCBY-4.0 license upon acceptance, to facilitate applications from both academia and industry.
翻译:尽管检索增强生成(RAG)已迅速应用于科学和临床问答系统,但医学领域仍缺乏全面的评估基准。为填补这一空白,我们提出了医学检索增强生成(MRAG)基准测试,涵盖中英文多种任务,并基于维基百科和PubMed构建了语料库。此外,我们开发了MRAG工具包,以促进对不同RAG组件的系统性探索。实验结果表明:(a)RAG在MRAG各项任务中均能提升大语言模型的可靠性;(b)RAG系统的性能受检索方法、模型规模和提示策略的影响;(c)虽然RAG提升了回答的实用性和推理质量,但对于长篇幅问题,大语言模型生成答案的可读性可能略有下降。我们将在论文录用后以CCBY-4.0协议开源MRAG-Bench数据集与工具包,以促进学术界与工业界的应用研究。