Medical information retrieval (MIR) is essential for retrieving relevant medical knowledge from diverse sources, including electronic health records, scientific literature, and medical databases. However, achieving effective zero-shot dense retrieval in the medical domain poses substantial challenges due to the lack of relevance-labeled data. In this paper, we introduce a novel approach called Self-Learning Hypothetical Document Embeddings (SL-HyDE) to tackle this issue. SL-HyDE leverages large language models (LLMs) as generators to generate hypothetical documents based on a given query. These generated documents encapsulate key medical context, guiding a dense retriever in identifying the most relevant documents. The self-learning framework progressively refines both pseudo-document generation and retrieval, utilizing unlabeled medical corpora without requiring any relevance-labeled data. Additionally, we present the Chinese Medical Information Retrieval Benchmark (CMIRB), a comprehensive evaluation framework grounded in real-world medical scenarios, encompassing five tasks and ten datasets. By benchmarking ten models on CMIRB, we establish a rigorous standard for evaluating medical information retrieval systems. Experimental results demonstrate that SL-HyDE significantly surpasses existing methods in retrieval accuracy while showcasing strong generalization and scalability across various LLM and retriever configurations. CMIRB data and evaluation code are publicly available at: https://github.com/CMIRB-benchmark/CMIRB.
翻译:医学信息检索(MIR)对于从电子健康记录、科学文献和医学数据库等多种来源检索相关医学知识至关重要。然而,由于缺乏相关性标注数据,在医学领域实现高效的零样本密集检索面临重大挑战。本文提出一种称为自学习假设文档嵌入(SL-HyDE)的新方法来解决这一问题。SL-HyDE利用大型语言模型(LLMs)作为生成器,根据给定查询生成假设文档。这些生成的文档封装了关键的医学上下文信息,可指导密集检索器识别最相关的文档。该自学习框架利用未标注的医学语料库,逐步优化伪文档生成和检索过程,且无需任何相关性标注数据。此外,我们提出了基于真实医学场景构建的综合评估框架——中文医学信息检索基准(CMIRB),涵盖五项任务和十个数据集。通过在CMIRB上对十个模型进行基准测试,我们为评估医学信息检索系统建立了严格标准。实验结果表明,SL-HyDE在检索准确率上显著超越现有方法,并在不同LLM和检索器配置中展现出强大的泛化能力和可扩展性。CMIRB数据与评估代码已公开于:https://github.com/CMIRB-benchmark/CMIRB。