Entity retrieval plays a crucial role in the utilization of Electronic Health Records (EHRs) and is applied across a wide range of clinical practices. However, a comprehensive evaluation of this task is lacking due to the absence of a public benchmark. In this paper, we propose the development and release of a novel benchmark for evaluating entity retrieval in EHRs, with a particular focus on the semantic gap issue. Using discharge summaries from the MIMIC-III dataset, we incorporate ICD codes and prescription labels associated with the notes as queries, and annotate relevance judgments using GPT-4. In total, we use 1,000 patient notes, generate 1,246 queries, and provide over 77,000 relevance annotations. To offer the first assessment of the semantic gap, we introduce a novel classification system for relevance matches. Leveraging GPT-4, we categorize each relevant pair into one of five categories: string, synonym, abbreviation, hyponym, and implication. Using the proposed benchmark, we evaluate several retrieval methods, including BM25, query expansion, and state-of-the-art dense retrievers. Our findings show that BM25 provides a strong baseline but struggles with semantic matches. Query expansion significantly improves performance, though it slightly reduces string match capabilities. Dense retrievers outperform traditional methods, particularly for semantic matches, and general-domain dense retrievers often surpass those trained specifically in the biomedical domain.
翻译:实体检索在电子健康记录(EHRs)的利用中扮演着关键角色,并广泛应用于各类临床实践。然而,由于缺乏公开基准,对该任务的全面评估尚属空白。本文提出开发并发布一个用于评估EHRs中实体检索的新型基准,特别关注语义鸿沟问题。我们利用MIMIC-III数据集中的出院小结,将与记录相关联的ICD编码和处方标签作为查询,并采用GPT-4进行相关性标注。总计使用1,000份患者记录,生成1,246个查询,并提供超过77,000项相关性标注。为首次评估语义鸿沟,我们引入了一种新颖的相关性匹配分类体系。借助GPT-4,我们将每个相关对归类至以下五种类别之一:字符串匹配、同义词匹配、缩写匹配、下位词匹配及蕴含关系匹配。基于所提出的基准,我们评估了多种检索方法,包括BM25、查询扩展以及最先进的密集检索模型。研究发现:BM25提供了强劲的基线性能,但在语义匹配方面存在不足;查询扩展能显著提升性能,但会轻微削弱字符串匹配能力;密集检索模型优于传统方法,尤其在语义匹配方面表现突出,且通用领域的密集检索器往往优于生物医学领域专门训练的模型。